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Introduction: 


Periodic  mass  screening  of  asymptomatic  women  is  rapidly  gaining  approval  and 
acceptance,  and  the  population  segment  recommended  for  screening  is  increasing  due  to 
increasing  compliance,  longer  life  expectancy,  and  earlier  recommended  age  for  initial 
examination  [1-3].  The  large  variability  in  a  number  of  important  aspects  related  to 
mammography,  as  practiced  in  the  U.S.,  resulted  in  the  enactment  of  the  Mammography 
Quality  Standards  Act,  which  mandates  accreditation  of  each  program  (facility,  technical, 
and  professional)  [4,5].  Shortages  of  expert  mammographers  in  many  locations,  combined 
with  the  desire  to  make  it  convenient  for  the  patient  to  undergo  the  procedure,  suggest  that 
there  may  be  a  need  for  high-quality  tele-mammography  systems  that  enable  a  distributed 
acquisition-centralized  expert  review  type  solution  to  the  problem,  particularly  in 
underserved  areas  [6,  7].  The  relatively  high  recall  rates  (5-15%)  of  screened  women  to 
supplement  information  that  was  not  ascertained  during  the  initial  visit  (e.g.  magnification 
views,  ultrasound)  also  make  it  desirable  to  enable  physician  “monitoring”  and 
“management”  of  remote  underserved  locations  so  that  some  patient-management  decisions 
can  be  made  while  the  patient  remains  in  the  clinic  [8-11].  In  addition,  a  technologist  who 
observes  a  possible  abnormality  during  the  performance  of  the  study  could  benefit  from  the 
ability  to  communicate  her/his  suspicion,  and  an  expert  mammographer  could  review  the 
specific  case,  together  with  the  technologist’s  observation,  resulting  in  an  improved  and 
perhaps  a  more  timely  diagnosis.  Current  practices  result  in  increased  patient  anxiety  and 
added  practice  complexity  and  cost.  Even  in  practices  in  the  urban  setting,  recommendations 
for  recall  are  not  always  followed  by  the  woman  and  eliminating  the  need  to  return  to  the 
clinic  through  implementation  of  this  concept,  in  particular  in  remote  locations,  could 
increase  overall  compliance.  Early  attempts  to  develop  and  implement  a  practical  tele¬ 
mammography  solution  to  this  problem  failed  due  to  several  significant  technical  problems 
associated  with  acquisition,  transmission,  management,  and  display  of  the  images  and  other 
related  information  [12-14].  Many  of  these  technical  issues  have  been  resolved  in  recent 
years,  but  some  remain  [14-18].  Although  an  adequate  communication  infrastructure  for 
high-quality  tele-mammography  is  available  within  some  urban  regions,  the  fact  remains  that 
where  it  may  be  needed  most  (i.e.  remote,  non-urban  locations),  enabling  (two-way) 
communication  systems  remain  limited  to  lower  level  communication  capabilities.  Other 
communication  technologies,  such  as  satellites,  are  being  evaluated  for  this  purpose,  but  it  is 
not  likely  that  these  will  displace  lower  level  communication  technologies  in  many 
underserved  areas  for  quite  some  time  [19-23].  Hence,  the  problem  of  cost  effective,  timely 
remote  patient  monitoring  and  management  in  many  underserved  areas  is  not  a  simple  one. 

As  a  part  of  this  project,  we  assembled  and  evaluated  a  unique  tele-mammography 
system  that  enables  improved  communication  between  remote  sites  where  physicians  are  not 
always  available  during  the  mammographic  acquisition  process  and  a  central  location  where 
experts  can  review  the  acquired  images  shortly  after  acquisition  and  assess  whether  or  not 
additional  procedures  (e.g.,  spot  compression,  magnification  views,  ultrasound)  are  needed. 
The  system  was  designed  based  on  prior  preliminary  experience  acquired  in  our  group  during 
ten  years  of  research  in  this  general  area  [24,  25].  It  includes  the  use  of  a  common  carrier  for 
communication  (Plain  Old  Telephone  System,  POTS)  and  other  “low  level”  communication 
capabilities,  wavelet-based  image  compression  for  data  reduction,  and  the  optional 
incorporation  into  the  transmitted  information  of  other  text  information  such  as  location  of 
suspected  abnormality,  and  CAD  results.  The  main  goal  was  to  assess  in  a  step-by-step, 
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clinically  simulated  approach  whether  the  use  of  such  a  system  could  potentially  reduce 
recall  rates  in  the  remote  sites.  Other  objectives  regarding  measurements  of  actual  practice 
parameters  in  a  large  academic  based  screening  mammography  practice  were  performed. 
Last,  ways  to  improve  communication  between  the  technologist  at  the  remote  site  and  a 
radiologist  at  the  central  site,  as  well  as  creating  an  environment  for  “more  active” 
participation  of  the  technologist  in  the  diagnostic  process,  were  also  explored. 

Body: 

Since  the  initiation  of  the  project  on  September  1,  2000,  we  have  been  executing  step 
by  step  the  tasks  listed  in  the  Statement  of  Work,  as  originally  submitted.  As  will  be 
explained  in  the  body  of  this  final  report,  our  initial  findings  resulted  in  the  addition  of 
several  technical  and  observational  tasks  that  were  successfully  performed  in  order  to 
maximize  our  ability  to  learn  about  the  practical  applications  being  investigated  in  this 
project.  As  an  example,  during  year  four  of  the  project,  a  significant  new  addition 
(capability)  was  added  to  the  system  as  a  result  of  our  previous  observations  from  the 
clinically  simulated  experiments.  We  incorporated  the  ability  to  submit  (transmit  from  the 
remote  sites)  both  prior  mammograms  (when  available)  as  an  integral  part  of  the  examination 
to  be  evaluated  as  well  as  an  interactive  overlay  drawing  of  the  examination  with  location(s) 
marking  of  suspected  abnormality(s)  to  be  reviewed  by  a  radiologist.  This  required  a 
substantial  technical  effort  and  ultimately  resulted  in  a  major  software  upgrade  of  the  system. 
Hence,  substantial  parts  of  the  last  clinically  simulated  experiment  were  performed  during  a 
one  year  no-cost  extension  of  the  project. 

Under  Task  1,  we  performed  the  following: 

All  subtasks  listed  under  this  task  were  completed.  We  assembled  and  tested  a  multi¬ 
site  tele-mammography  system  that  met  (and  in  several  respects  exceeded)  our  originally 
proposed  specifications.  The  status  of  the  tasks  described  under  this  category  is  as  follows: 

a)  Select  and  Purchase  Equipment:  During  year  one  of  the  project,  we  purchased 
and  tested  a  significant  amount  of  equipment  in  support  of  the  project  that  was 
funded  mainly  from  other  sources.  This  includes,  but  is  not  limited  to,  computers, 
laser  printers  and  film  digitizers.  During  the  selection  phase,  we  performed  a 
comprehensive  side-by-side  evaluation  of  the  VEDAR  and  Lumisys  film  digitizers 
to  assess  whether  or  not  the  CCD-based  VIDAR  digitizer  could  be  used  for  this 
purpose.  Our  assessment  resulted  in  confirmation  that  the  Lumisys  film  digitizer 
was  significantly  more  robust  and  that  the  signal-to-noise  ratio  at  high  frequencies 
is  significantly  higher.  In  addition,  the  new  digitizer  raises  the  maximum  optical 
density  to  ~3.8,  which  is  a  significant  advantage  over  the  older  versions.  As  a 
result,  we  purchased  three  additional  digitizers  for  the  performance  of  this  project. 
We  also  acquired  (at  no  cost  to  the  project)  a  Kodak  8600  model  laser  printer, 
tested  it,  and  developed  an  interface  to  control  it  remotely. 

b)  Convert  Software  to  Windows  Based:  The  general  design  of  the  tele¬ 
mammography  project  was  reconsidered,  and  software  was  written  using  the  NT-operating 
system  to  enable  significantly  more  flexibility  for  the  different  applications  that  could  be 
implemented.  This  task  was  completed  and  after  initial  testing,  refinements  were  performed. 
All  communication  tasks  have  also  been  tested  using  the  modified  software. 
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c)  Develop  Interface  to  FFDM  Acquisition  System:  We  designed  and  developed  an 
interface  on  the  system  to  accept  DICOM  images  that  were  acquired  on  FFDM  systems.  We 
transferred  FFDM  images  to  the  server  and  displayed  these  on  the  workstation  at  the  central 
site  in  addition  to  providing  printing  capability.  All  the  functionality  specifications  were 
tested,  but  our  clinical  practice  postponed  the  transition  to  FFDM  based  screening  (as  we 
reported  in  years  1-3  of  the  project)  hence  other  than  enabling  the  tele-mammography  system 
to  accept  FFDM  based  examinations  (in  a  manner  compatible  with  all  other  clinical 
requirements),  all  of  our  clinically  simulated  studies  throughout  the  project  were  film  based. 
Our  own  transition  to  a  fully  digital  system  is  underway  and  will  be  completed  by  next 
September  (2006).  However  we  are  a  large  academic  center  and  it  is  not  clear  at  all  that  the 
use  of  FFDM  devices  in  remote  “underserved”  sites  for  screening  purposes  is  likely  to  be 
common  or  appropriate  in  the  near  future. 

d)  Develop  a  New  User  Interface  for  the  Acquisition  Sites:  A  remote  site  user 
interface  was  completed  and  tested,  both  subjectively  (by  staff  members  and  technologists) 
and  objectively  (by  sending  over  100  cases  through  the  system).  After  minor  modifications 
that  were  based  on  users’  comments,  our  data  entry  and  case-sending  routines  were  finalized. 

e)  Complete  Data  Compression  Software  Module:  A  compression  software  scheme 
compatible  with  JPEG  2000  was  finalized  and  tested.  The  scheme  allowed  for  a  site-specific 
selectable  level  of  compression  to  be  used.  However  after  the  initial  testing  was  completed, 
we  fixed  the  compression  levels  for  all  sites  (see  below).  In  addition  to  the  data  compression 
module,  the  approach  we  incorporated  in  the  system  includes  a  comprehensive  tissue 
segmentation  routine  followed  by  a  wavelet  transform  and  “dialable”  data  compression 
module.  This  segmentation  routine  enables  a  very  efficient  data  reduction  by  eliminating 
non-tissue  regions  of  the  image  without  any  loss  of  information. 

f)  Develop  and  Refine  Measures  of  Image  Fidelity  that  can  be  used  to 
Automatically  Monitor  and  Adjust  (if  needed)  Compression  Levels  on  an  Image-bv- 
Image  Basis:  Based  on  two  independent  tests  (see  evaluation  section  below),  at  two 
compression  levels,  50:1  and  75:1,  we  enabled  a  “dial-up”  compression  capability  in  the 
system.  However,  we  also  found  out  that  the  physicians’  high  level  of  acceptance  of  either 
compression  level  practically  eliminated  the  need  for  this  “selectable”  option.  Therefore,  we 
proceeded  using  the  system  with  a  fixed  level  of  compression  (75:1),  at  all  sites. 

g)  Integrate  all  Software  Modules:  All  software  modules  were  successfully  integrated. 

h)  Develop  Display  Protocols  for  the  Workstation:  User-friendly  display  protocols 
were  developed  and  tested  extensively  (see  system  evaluation  section). 

i)  Assemble  System:  The  system  was  assembled  as  proposed. 

j)  Test  System  in  Laboratory:  The  system  was  tested  extensively  in  the  laboratory. 

k)  Trouble-Shoot,  Refine,  and  Finalize  System:  Through  refinements,  we  increased 
the  operational  ease-of-use  and  reliability  of  the  system  and  finalized  the  base  configuration 
for  implementation  and  installation  at  remote  sites. 
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1)  Prepare  Clinical  Sites  for  Implementation:  All  three  remote  sites  were  prepared  for 
system  implementation  as  required. 

Additional  development  efforts: 

As  our  step-by-step  clinically  simulated  experiments  progressed,  we  kept  adding 
functionality  to  the  system.  These  changes  required  modifications  that  enabled  an  integrated, 
easy  to  use  functionality  and  in  year  four  of  the  project  we  decided  to  include  prior  images 
(examinations)  to  the  case  folder.  Ultimately,  we  enabled  the  following  tools  on  the  system: 

1)  text  messaging  (namely,  two-way  “chat”  between  the  remote  technologist  and  central 
radiologists),  2)  marking  of  suspicious  locations  (namely,  the  technologist  marks  suspicious 
regions  on  an  image  overlay),  3)  CAD  results,  4)  prior  mammography  reports,  and  5)  prior 
images  (when  available).  The  reason  for  the  additional  tools  was  to  provide  the  radiologist  at 
the  central  location  with  all  possible  tools  to  enable  better  assessment  of  the  examinations 
being  sent  for  review  (obviously,  this  is  all  done  in  addition  to  the  actual  mammographic 
images  in  question).  Hence  the  last  task  (“high  volume  clinically  simulated  demonstration 
and  evaluation”)  was  performed  with  all  tools  available  to  the  technologists  (at  the  remote 
sites)  and  the  physicians  (at  the  central  site).  A  major  upgrade  was  installed  and  tested  for 
this  purpose  in  year  four  of  the  project. 


Under  Task  2,  we  performed  the  following: 

a)  All  needed  equipment  was  moved  to  the  appropriate  locations  at  the  three  remote 
sites.  At  each  location,  the  equipment  (send  station  and  digitizer)  was  located  at  an  easily 
accessible  place.  At  the  central  site,  we  placed  the  “receive”  workstation  in  a  “screening” 
reading  room  at  a  central  location  within  our  Breast  Center.  This  required  some  construction 
that  was  completed  at  no  cost  to  the  project. 

b)  The  complete  system  was  reassembled  on  location  at  all  sites. 

c)  Technical  and  operational  performance  levels  were  retested  on  site. 

d)  Different  evaluation  protocols  for  initial  system  evaluations  were  developed  and 
implemented. 

1)  100  cases  were  randomly  selected  at  each  site  and  transmitted  to  a  central  site 
to  assess  ease-of-use,  reliability,  reproducibility,  and  cycle  times.  The  results  clearly 
indicate  that  cases  from  all  sites  at  15,  20,  and  90  miles  away  can  be  transmitted  with 
a  full  duty  cycle  time  (from  data  entry  at  remote  site  to  display)  that  easily  meets  our 
proposed  specifications.  A  four-image  case  can  be  completed  in  less  than  seven 
minutes  using  75:1  compression,  which  is  less  than  half  the  time  we  originally 
specified. 

2)  We  performed  a  multi-reader  subjective  assessment  of  image  quality,  and  all 
participating  radiologists  rated  the  quality  as  acceptable  or  better  for  the  task  at  hand. 

3)  We  evaluated  differences  in  image  quality  on  film  and  soft  display  at  zero 
(no),  50:1,  and  75:1  compression  ratios  and  found  that  only  under  extreme 
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magnification,  the  75:1  level  can  be  identified  (recognized),  but  image  quality  is  not 
significantly  degraded  for  all  practical  purposes. 

4)  The  design  considerations  and  initial  testing  were  published  in  comprehensive 
SPIE  reports  (see  references  3  and  8  in  the  “Reportable  Outcomes”  section). 


Under  Task  3,  we  performed  the  following: 

1)  Retrospective  observer  performance  studies: 

A  step  by  step  assessment  protocol  was  implemented  in  this  project.  Four  independent 
observer  performance  studies  were  performed  as  information  provided  to  the  reader  at  the 
remote  site  was  incrementally  increased.  In  the  first  study,  306  examinations  of  all  types 
(without  a  rigorous  selection  process)  were  sent  from  all  sites  and  were  read  at  the  central  site 
by  5  radiologists.  The  study  included  a  large  number  of  cases  that  were  (and  some  that  were 
not)  suspected  by  the  technologists  at  the  remote  site  as  possibly  needing  additional 
procedures.  The  results  suggested  see  table  1  that  a  large  number  of  additional  procedures 
would  be  performed  in  the  remote  site  in  order  to  reduce  recall  rates  (by  approximately  a 
ratio  of  3:1  or  433  additional  procedures  would  have  been  needed  to  reduce  recalls  by  151). 


AH  readers  combined  (same  cases) 

clinical  read 


study 


read 


recall 

no 

recall 

total 


recall 

no  recall 

total 

151 

433 

584 

59 

887 

946 

210 

1320 

1530 

overall 

agreement 

1038 

prob- 

observed  = 

0.678 

prob- 

expected  = 

0.586 

Kappa  = 

0.224 

As  a  result,  we  performed  three  additional  observer  performance  studies  while 
gradually  increasing  the  amount  of  information  transmitted  to  the  central  site.  In  all  three 
studies  cases  were  specifically  selected  by  the  technologists  when  they  felt  during  the  QA 
review  of  the  examinations  in  question  that  the  women  would  likely  be  recalled  by  the 
radiologist  for  additional  procedures. 

A  synopsis  of  the  three  observer  performance  studies  in  this  area  follows:  registered, 
experienced  mammography  technologists  from  three  remote  imaging  sites  transmitted  245 
screening  mammography  exams  to  a  central  site  (radiologists),  which  they  (the  technologists) 
believed  needed  additional  procedures.  Four  data  components  are  transmitted  from  the 
remote  site:  (1)  image  data  -  current  exam  mammography  films  digitized  at  50  pm  pixel 
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dimensions;  (2)  text  and  graphic  communication  between  the  technologist  and  the  radiologist 
via  a  “chat”  box  in  which  the  technologist  can  describe  and  mark  suspicious  regions  on 
integrated  generic  images;  (3)  prior  patient  reports  when  available;  and  (4)  computer-aided 
detection  (CAD)  results.  At  the  central  site  images  are  displayed  on  a  workstation  consisting 
of  three  high-resolution,  portrait  monitors.  The  image  data  with  the  CAD  results  overlaid  are 
displayed  on  two  monitors  and  the  chat  box  and  prior  reports  on  the  third  monitor.  Seven 
radiologists  reviewed  and  rated  the  exams  on  the  tele-mammography  workstation  and 
indicated:  (1)  if  additional  procedures  were  recommended,  (2)  when  appropriate,  which 
breast  was  involved,  and  (3)  when  appropriate,  the  specific  recommended  procedures.  The 
performance  of  the  radiologists  on  the  workstation  was  compared  with  the  actual  clinical 
interpretation  of  the  same  examinations.  Study  1  had  two  interpretation  modes:  (1)  images 
only  and  (2)  images  and  technologist’s  text  message.  Study  2  had  two  modes:  (1)  images 
and  technologist’s  text  message  and  (2)  images,  text  message,  and  prior  report.  Study  3  had 
three  modes:  (1)  images,  technologist’s  text  message,  and  prior  report;  (2)  images,  text 
message,  prior  report,  and  technologist’s  graphic  location  marks;  and  (3)  images,  text 
message,  prior  report,  graphic  marks  (location),  and  CAD  results.  Amongst  other  analyses, 
we  computed  the  potential  improvements  in  terms  of  projected  reduction  in  recall  rates  at  the 
remote  sites  and  associated  “costs”  in  terms  of  “unnecessary”  additional  procedures. 

Results:  Technologists  were  able  to  identify  suspicious  examinations  that  may 
require  additional  procedures,  but  their  “recommended”  examinations  amounted  to  a 
substantially  larger  number  compared  with  that  of  a  clinical  interpretation  by  a  radiologist. 
The  screening  exams  were  successfully  transmitted,  processed,  reviewed,  and  rated.  The 
percent  of  exams  recalled  for  recommended  additional  procedures  (termed  “recall”)  during 
the  actual  clinical  interpretation  for  Studies  1  (n  =  130),  2  (n  =  99),  and  3  (n  =  115)  were 
39.2%,  38.4%,  and  42.2%,  respectively.  Tele-mammography  Study  1;  modes  1  and  2  had 
mean  recall  rates  of  73.3%  (+/-  17.9)  and  82.5%  (+/-  16.2),  respectively,  and  mean 
agreements  of  51.7%  (+/-  5.5)  and  48.7%  (+/-  6.3),  respectively.  Study  2;  modes  1  and  2  had 
mean  recall  rates  of  79.6%  (+/-  12.3)  and  77.5%  (+/-  13.8),  respectively,  and  mean 
agreements  of  52.3%  (+/-  6.7)  and  52.8%  (+/-  7.0),  respectively.  Study  3;  modes  1,  2  and  3 
had  mean  recall  rates  of  72.3%  (+/-  9.3),  72.3%  (+/-  9.3),  and  72.7%  (+/-  9.2),  respectively, 
and  mean  agreements  of  57.4%  (+/-  4.6),  57.1%  (+/-  3.9),  and  56.7%  (+/-  3.9),  respectively. 
However,  it  should  be  remembered  that  without  radiologists’  reviews  100  percent  of  these 
women  were  “recommended”  for  additional  procedures  by  the  technologists. 

In  these  studies,  we  demonstrated  that  between  70  and  85  percent  of  recalls  (as 
ultimately  were  decided  during  the  clinical  interpretation)  could  have  been  avoided,  albeit  at 
a  high  “cost”  of  performing  additional  procedures  on  these  women.  As  we  increased  the 
information  provided  to  the  radiologist  from  “text  message”  alone  to  text  message,  prior 
reports  and  a  location  overlay  to  all  of  the  above  plus  CAD  results  the  number  of 
“unnecessary”  procedures  recommended  by  the  radiologists  reduced  progressively  from  1 .45 
(246/183)  to  1.26  (171/136)  to  1.07  (216/202)  per  “saved”  recall.  As  can  be  seen  in  the  last 
task  this  number  was  further  reduced  to  0.94  (81/86)  during  the  last  experiment  (see  task  #5 
below). 

2)  Clinical  assessment  of  performance  levels: 

a.  There  are  several  aspects  of  the  task  that  are  worth  noting.  First  we  were  “breaking 

ground”  in  several  respects  that  include  but  are  not  limited  to  the  involvement  of 

technologists  in  the  decision-making  process  (namely,  which  cases  to  send  over  to  the 
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central  site  and  why),  and  possibly  the  increased  “reliance”  of  the  radiologists  on  the 
technologists’  judgments.  Our  subjective  assessments  of  this  issue  clearly  indicated 
that  both  radiologists  and  technologists  welcomed  increased  communication.  We 
often  heard  comments  like  “the  technologists  often  identify  abnormalities  before  we 
do  and  sometimes  see  thing  we  do  not”.  However,  our  practice  is  a  very  established 
one  and  it  is  not  clear  that  this  “reliance”  and  “trust”  would  exist  in  other  practices. 

b.  As  a  part  of  this  investigation,  we  assessed  our  clinical  performance  levels  in  the 
traditional  practice  (without  tele-mammography).  We  analyzed  data  available  in  our 
databases  concerning  patient  distributions  and  process-related  information.  This 
includes,  but  is  not  limited  to,  the  recall  rate  by  physician,  site,  type,  and  reason  for 
recall.  Our  recall  rates  and  cancer  detection  rates  were  found  to  be  very  stable  for  the 
group  of  radiologists  as  a  whole  and  individual  radiologists  as  well. 

c.  We  also  reviewed  records  concerning  the  cycle  time  from  the  initial  examination  to  a 
definitive  diagnosis  for  cases  that  were  not  being  recalled,  as  well  as  cases  that  were. 
One  of  the  more  interesting  (and  relevant)  findings  in  this  regard  was  the  long  cycle 
time  including  scheduling  (average  was  >  20  days  at  the  time)  between  the  patient’s 
call  for  an  appointment  due  to  a  recall  and  the  actual  date  of  examination.  This 
highlights  the  potential  benefit  of  the  use  of  tele-mammography  to  reduce  recall  rates 
(hence  cycle  time),  in  particular  in  busy  practices  such  as  ours.  This  information, 
which  is  now  reviewed  monthly,  generated  a  significant  effort  in  our  system 
(including,  but  not  limited  to,  performing  diagnostic  sessions  during  the  weekends) 
and  we  have  been  able  to  make  substantial  improvements  in  this  regard. 

d.  During  the  project  period,  we  completed  a  large  study  to  assess  the  effect  of  the 
introduction  of  CAD  into  our  clinical  environment  and  the  relationship  between  recall 
rates  and  detection  rates  for  our  ten  highest  volume  radiologists.  One  of  the  important 
issues  that  was  raised  in  our  group  was  the  correlation  (if  any)  between  the  recall  and 
detection  rates  of  radiologists.  This  is  an  important  point  since  there  is  a  significant 
pressure  on  radiologists  to  reduce  their  individual  recall  rates  to  below  ten  percent. 
While  we  recognize  the  tremendous  value  of  reducing  recall  rates  without  a 
substantial  degradation  in  detection  rates  (sensitivity),  the  question  arises  as  to 
whether  or  not  higher  recall  rates  are  also  generally  associated  with  higher  detection 
rates.  These  studies  involved  the  reviews  of  over  115,000  records  and  resulted  in 
important  observations  that  were  published  in  JNCI  and  Cancer  (see  publications 
list).  We  strongly  believe  that  the  use  of  CAD  will  ultimately  be  an  integral  part  of 
the  diagnostic  process  and  some  of  our  continuing  efforts  to  develop  and  improve 
CAD  schemes  were  supported  (only  to  a  very  minimal  level)  by  this  project  (see 
publications  list). 

e.  As  indicated  above,  we  were  not  able  to  assess  practice  parameters  for  screening 
mammography  using  FFDM  because  in  our  system  the  system  had  been  used  largely 
in  diagnostic  procedures  (rather  than  screening). 
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Under  Task  4,  we  performed  the  following: 


a.  CAD  Software  Module: 

A  software  module  specifically  designed  for  the  tele-mammography  system  was  designed 
and  written.  This  module  is  different  than  our  CAD  development  efforts  in  that  it  is  very 
flexible  and  any  new  CAD  development  in  our  research  projects  can  be  easily  incorporated 
into  the  system. 

b.  CAD  Incorporation: 

During  the  third  year  of  the  project  we  completed  the  design,  implementation  and  testing 
of  the  modular  software  set  of  routines  that  enable  the  incorporation  of  CAD  into  the  tele¬ 
mammography  system  at  the  remote  (sending)  sites  prior  to  compressing  the  images  and 
transmitting  the  results  to  the  central  site. 

c.  CAD  Technical  Performance  Evaluation: 

The  system  was  tested  technically  using  over  100  cases,  and  after  de-bugging,  we 
incorporated  the  final  module  into  the  operations.  In  the  last  two  years,  all  transmitted  cases 
were  processed  by  the  CAD  scheme  and  could  be  displayed  on  the  workstation  with  and 
without  the  CAD  results  at  the  operator’s  discretion.  The  technical  performance 
specifications  of  the  system  were  not  violated  due  to  the  incorporation  of  CAD  because  the 
program  is  faster  than  the  digitization  process  and  is  done  in  parallel  to  all  other  tasks  while 
the  case  (examination)  is  being  processed. 

d.  CAD  Operational  and  Clinical  Use: 

The  operational  use  of  CAD  results  was  tested  using  a  retrospective  clinical  review  and 
found  acceptable.  The  clinical  aspects  of  this  added  feature  were  evaluated  in  an  observer 
performance  study. 

e.  Performance  Analyses: 

The  impact  of  the  use  of  CAD  on  radiologists’  ability  to  make  better  interpretations  in  regard 
to  the  need  for  additional  procedures  in  specific  cases  was  assessed.  The  result  of  this  effort 
was  a  reduction  by  -18%  (1.26/1.07)  in  the  recommended  procedures  per  “saved”  recall  as 
describe  under  task  3.  As  a  result  of  this  study,  all  cases  transmitted  to  the  central  site  in  task 
#5  included  the  CAD  results,  as  well  (see  below). 


Under  Task  5,  Clinically  simulated  almost  real  time  transmission  and  reporting: 

Under  this  task  we  performed  several  pilot  studies  throughout  the  project  in  preparation 
for  a  simulated  clinical  study  that  took  place  during  the  fifth  year  (no  cost  extension).  As 
indicated,  the  reason  for  the  delay  was  the  finding  that  remotely  determined 
recommendations  for  additional  procedures  would  remain  high  unless  we  transmit  the 
prior  examinations  (when  available)  together  with  the  current  examination  of  interest. 
Once  the  system  upgrade  was  completed,  the  performance  of  an  “almost  real  time  -  high 
volume”  demonstration  of  the  transmission  of  suspected  cases  at  the  remote  sites  and  a 
clinically  simulated  response  from  the  central  site  commenced.  During  a  period  of  five 
months  when  over  4,000  screening  examinations  were  performed  at  the  three  remote 
sites,  we  asked  the  technologists  to  identify  and  send  all  cases  (with  all  available  related 
information)  they  believed  would  need  to  be  recalled  during  their  QA  procedures  as  they 
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perform  the  screening  procedure.  During  four  real  time  experiments  and  nine  simulated 
real  time  experiments,  radiologists  reviewed  all  the  information  sent  from  the  remote 
sites  and  responded  to  the  site.  The  real  time  experiments  required  that  cases  were  sent 
simultaneously  from  all  three  sites  and  that  during  the  simulated  experiments  cases  were 
sent  as  they  became  available  at  each  site.  Per  our  protocol,  the  recommendations  of  the 
radiologists  were  not  acted  upon  but  were  stored  and  compared  with  the  actual  clinical 
recommendations  for  the  same  examination  at  the  clinic.  353  cases  were  sent  and 
reviewed  by  radiologists  in  this  experiment  and  the  results  are  summarized  below. 


All 


readers 

study 

recall 

no 

read 

recall 

total 

clinical  read 


recall 

no  recall 

total 

86 

81 

167 

36 

150 

186 

122 

231 

353 

overall 

agreement 

236 

prob- 

observed  = 

0.6686 

prob- 

expected  = 

0.5083 

Kappa  = 

0.3259 

The  details  of  this  effort  are  being  written  for  a  publication  at  this  time  but  the  essence  of  the 
results  indicated  that:  1)  The  provision  of  the  prior  examinations  improved  performance 
significantly  as  compared  with  our  previous  studies.  2)  At  the  cost  of  81  additional 
procedures  while  the  women  could  remain  in  the  remote  clinic  and  assuming  the  clinical  read 
would  ultimately  result  in  the  36  recalls  that  were  not  recommended  remotely,  86  out  of  the 
122  actual  recalls  (70.5%)  could  have  been  avoided.  This  finding  is  important  for  remote 
underserved  locations  and  the  decision  of  whether  such  a  practice  is  acceptable  will  depend 
largely  on  the  nature  of  the  practice  at  the  remote  site. 

During  the  five  observer  performance  studies  alone,  we  performed  a  total  of  5440  clinically 
simulated  interpretations  on  the  tele-mammography  workstation  that  were  each  compared 
with  the  actual  recommendations  made  during  the  clinical  interpretations  of  the  same 
examination. 


Key  Research  Accomplishments: 

During  the  last  five  years,  we  have  been  progressing  according  to  the  original  plan 
and  were  able  to  address  a  large  number  of  the  technical,  operational  and  practice  related 
issues  associated  with  the  design,  implementation,  and  clinically  simulated  testing  of  the 
multi-site  tele-mammography  system.  The  key  accomplishments  were: 
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•  We  designed,  developed,  implemented,  installed  and  tested  a  unique,  multi-site  tele¬ 
mammography  system  that  meets  (and  in  many  areas  exceeds)  the  technical 
specifications  we  originally  anticipated  (and  proposed). 

•  We  successfully  and  reliably  transmitted  over  2,800  examinations  from  three  remote 
sites  to  the  central  site  (with  minimal  down  time  and  technical  problems).  This  set 
includes  2,432  examinations  that  were  used  in  the  different  studies  and  the  remainder 
were  examinations  sent  (sometimes  multiple  times)  for  system  testing  purposes  and 
deleted  after  completion  of  the  test. 

•  We  planned  and  executed  a  step-by-step  comprehensive,  technical,  and  clinical 
assessment  protocol  in  a  clinically  simulated  environment. 

•  We  have  been  able  to  coherently  engage  a  large  team  of  administrative,  technical, 
clinical  (i.e.,  technologist),  and  physician  personnel  in  a  large  and  complicated 
project. 

•  We  carried  out  comprehensive  reviews  of  the  practice  parameters  and  performance 
levels  of  our  radiologists  in  terms  of  recall  and  cancer  detection  rates  with  and 
without  the  use  of  CAD. 

•  We  continually  upgraded  the  system  as  needed  with  a  major  software  revision  in 
response  to  radiologists’  preferences  during  the  performance  of  the  specific  task  the 
tele-mammography  system  was  designed  for. 

•  We  successfully  reviewed  a  large  number  of  cases  on  the  workstation  and  generated  a 
clinically  simulated  response  to  the  remote  sites. 

•  We  completed  five  observer  performance  studies  to  assess  both  possible  utility  of  the 
system  as  well  as  agreement  levels  between  the  technologists  and  radiologists  on 
suspicious  cases. 

•  We  have  been  able  to  increase  the  communication  level  between  technologists  and 
physicians  in  regard  to  decision-making  processes,  and  we  are  engaged  in  discussions 
concerning  a  more  extensive  use  of  technologists  as  physician  extenders  in  several 
areas. 

•  We  demonstrated  that  in  principle  one  can  perform  effectively  and  efficiently  remote 
management  tasks  and  achieve  a  significant  reduction  in  actual  recall  rates,  with  a 
relatively  limited  increase  in  the  number  of  women  who  would  receive  additional 
procedures  during  their  initial  screening  visit.  This  concept  can  be  implemented  in  a 
manner  that  only  minimally  affects  workflow  in  a  busy  clinical  environment. 

Reportable  Outcomes: 

1)  Publications  and  Presentations 

As  we  developed  and  tested  the  system,  several  reports  were  generated.  Some  are 
directly  related  to  the  design  implementation  and  testing  of  the  system  and  others  are  related 
to  practice  assessment  tasks  that  were  performed.  The  clinically  simulated  study  which  was 
performed  during  the  last  year  is  being  analyzed  and  we  are  in  the  process  of  writing  a 
comprehensive  article  on  this  topic.  We  anticipate  submission  of  this  article  before  the  end  of 
the  year  (2005).  Published  reports  acknowledging  this  award,  to  date,  Include: 
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2)  Other  reportable  outcomes: 

The  effort  supported  by  this  project  helped  us  generate  preliminary  data  and  perhaps 
more  important  several  concepts  that  were  used  in  support  of  three  grant  applications 
that  are  currently  funded: 

1.  “Rule  Based  CAD  of  Digitized  Mammograms”,  PI:  David  Gur,  source: 
NIH,  grant  #  CA077850 

2.  “Interactive  CAD  for  Mammography”,  PI:  Bin  Zheng,  source:  NIH,  grant 
#  CA101733 

3.  “The  Laboratory  Effect  in  Breast  Cancer  Detection  Studies”,  PI:  David 
Gur,  source:  NIH,  grant  #  EB003503 

Personnel  receiving  pay  from  this  effort: 
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David  Gur,  Sc.D.,  Joseph  K.  Leader,  Ph.D.,  Glenn  S.  Maitz,  M.S.,  Yuan-Hsiang 
Chang,  Ph.D.,  Howard  E.  Rockette,  Ph.D.,  Jules  Sumkin,  D.O.,  Xiao  Hui  Wang, 
Ph.D.,  Bin  Zheng,  Ph.D.,  John  M.  Drescher,  B.S.,  Amy  H.  Klym,  B.S.,  Jennifer  S. 
Stalder,  B.S.,  Christopher  Traylor 

All  Radiologists  and  Technologists  participating  in  the  observer  performance  studies 
of  this  effort  were  not  paid  directly  from  the  grant.  Payments  were  made  to  the 
Department  of  Radiology  for  the  services  rendered  by  the  Radiologists  and 
Technologists. 

Last,  one  of  our  investigators  (Chang  Y-H)  in  the  first  three  years  of  the  project  has 
returned  to  Taiwan  where  he  is  employed  as  a  faculty  in  the  department  of  Electrical 
Engineering.  Two  other  investigators  (Drs.  Joseph  Leader  and  Xiao  Hui  Wang)  were 
promoted  during  the  project  duration  to  Research  Assistant  Professors  of  Radiology. 


Conclusions: 

A  comprehensive  multi-task,  multi-discipline  applied  project  that  involved  a  large 
team  of  investigators,  physicians  and  staff  was  successfully  executed  and  completed.  We 
undertook  a  large  number  of  technical  and  application-based  tasks  associated  with  the  design, 
implementation,  and  clinically  simulated  evaluations  of  a  multi-site  tele-mammography 
system.  We  modified  the  system  as  needed  and  exceeded  several  of  the  performance  goals 
we  originally  proposed.  The  concept  of  remote  management  of  screening  practices  where  a 
physician  is  not  present  was  tested  using  a  comprehensive  step-by-step  evaluation  and  was 
proven  to  be  feasible  which  could  result  in  improved  communication  between  technologists 
and  physicians  at  the  remote  and  central  sites.  Our  main  observation  to  date  is  that  the 
general  concept  is  sound  and  the  actual  implementation  resulted  in  an  appreciation  for  the 
importance  of  the  “comfort  level”  of  the  team  (physicians  and  technologists)  in  operating  and 
using  such  a  system  for  the  stated  purpose.  Most  important  perhaps  is  the  demonstration  that 
in  principle,  using  this  (or  a  similar)  approach,  one  could  achieve  a  significant  reduction  in 
actual  recall  rates  for  a  second  visit.  At  this  time,  it  can  only  be  done  at  some  cost  namely  an 
increase  in  the  number  of  women  who  would  receive  additional  procedures  (e.g.,  views) 
during  their  initial  screening  visit.  Last,  we  have  improved  substantially  our  understanding  of 
several  extremely  important  issues  related  to  the  general  practice  of  screening  mammography 
(e.g.  the  relationship  between  recall  rates  and  cancer  detection  rates),  and  the  use  of  CAD  in 
particular.  These  may  have  far  reaching  implications  on  this  field. 

So  What? 

The  issues  associated  with  efficient  and  efficacious  mammographic  screening  in 
general  and  in  remote  underserved  locations  in  particular  are  significant.  The  main  goal  of 
this  project  was  to  evaluate  how  the  use  of  an  “almost  real-time”  tele-mammography  system 
(with  or  without  the  use  of  relevant  information)  may  impact  the  diagnostic  process  in  terms 
of  complete  cycle  time  and  patients’  recall  rate.  Our  success  in  this  project  has  already 
changed  substantially  our  own  thinking  about  practice  issues  in  remote  sites  and  we  hope 
others  will  follow.  We  demonstrated  different  ways  to  increase  communication  between 
remote  (and  potentially  underserved)  sites  and  a  central  site.  Our  hope  is  that  by  using  the 
concepts  we  investigated,  one  may  be  able  to  provide  better,  more  timely  and  cost-effective 
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service  at  these  sites  and  in  the  process,  substantially  reduce  actual  recall  rates  in  remote 
facilities  where  a  physician  is  not  present.  Despite  significant  advances  in  our  understanding 
of  the  issues  and  alternatives  surrounding  “optimal”  practices  of  screening  mammography, 
many  of  our  current  clinical  practice  guidelines  are  based  on  limited  subjective  assessments 
and  anecdotal  experiences,  and  a  significant  fraction  is  related  to  operational  matters  in  busy 
urban  environments  that  are  staffed  by  experienced  radiologists.  The  area  of  optimizing 
remote,  underserved  practices  has  been  studied  only  in  a  cursory  manner.  Our  project  is  but 
one  attempt  to  improve  our  understanding  of  the  technical,  operational,  and  clinical  issues 
facing  these  facilities  and  implementing  technology-based  solutions  that  may  help  them 
provide  a  better  service  to  the  populations  they  serve.  Our  own  institution  is  basing  our 
transition  strategy  to  a  digital  environment  in  screening  mammography  partially  based  on  the 
observations  made  during  this  project  (albeit  using  a  PACS  enabled  remote  management 
rather  than  tele-mammography)  and  we  believe  others  should  consider  this  or  a  similar 
approach,  as  well. 
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Soft-Copy  Mammographic 
Readings  with  Different 
Computer-assisted  Detection 
Cuing  Environments: 
Preliminary  Findings1 


PURPOSE:  To  assess  the  performance  of  radiologists  in  the  detection  of  masses  and 
microcalcification  clusters  on  digitized  mammograms  by  using  different  computer- 
assisted  detection  (CAD)  cuing  environments. 

MATERIALS  AND  METHODS:  Two  hundred  nine  digitized  mammograms  depict¬ 
ing  57  verified  masses  and  38  microcalcification  clusters  in  85  positive  and  35 
negative  cases  were  interpreted  independently  by  seven  radiologists  using  five 
display  modes.  Except  for  the  first  mode,  for  which  no  CAD  results  were  provided, 
suspicious  regions  identified  with  a  CAD  scheme  were  cued  in  all  the  other  modes 
by  using  a  combination  of  two  cuing  sensitivities  (90%  and  50%)  and  two  false¬ 
positive  rates  (0.5  and  2.0  per  image).  A  receiver  operating  characteristic  study  was 
performed  by  using  soft-copy  images. 

RESULTS:  CAD  cuing  at  90%  sensitivity  and  a  rate  of  0.5  false-positive  region  per 
image  improved  observer  performance  levels  significantly  (P  <  .01).  As  accuracy  of 
CAD  cuing  decreased  so  did  observer  performances  (P  <  .01).  Cuing  specificity 
affected  mass  detection  more  significantly,  while  cuing  sensitivity  affected  detection 
of  microcalcification  clusters  more  significantly  (P  <  .01).  Reduction  of  cuing  sen¬ 
sitivity  and  specificity  significantly  increased  false-negative  rates  in  noncued  areas 
(P  <  .05).  Trends  were  consistent  for  all  observers. 

CONCLUSION:  CAD  systems  have  the  potential  to  significantly  improve  diagnostic 
performance  in  mammography.  However,  poorly  performing  schemes  could  ad¬ 
versely  affect  observer  performance  in  both  cued  and  noncued  areas. 


Breast  cancer  is  one  of  the  leading  causes  of  death  in  women  over  the  age  of  40  years  (1,2). 
To  reduce  mortality  and  morbidity  with  early  diagnosis  and  treatment,  current  guidelines 
recommend  periodic  mammography  screening  for  women  aged  40  and  over  (3).  Due  to 
the  large  number  of  mammographies  performed  and  the  low  yield  of  abnormalities 
detected  in  screening  environments,  detecting  abnormalities  (mainly  masses  and  micro¬ 
calcification  clusters)  from  the  background  of  a  complex  normal  anatomy  is  a  tedious, 
difficult,  and  time-consuming  task  for  most  radiologists  (4,5). 

Hence,  there  is  a  growing  interest  in  the  development  of  computer-assisted  detection 
(CAD)  schemes  for  mammography.  It  is  generally  believed  that  such  schemes  could 
eventually  provide  radiologists  with  a  valuable  "second  opinion"  and  help  improve  accu¬ 
racy  and  efficiency  of  breast  cancer  detection  at  an  early  stage  (6,7). 

To  assess  the  potential  for  improving  diagnostic  accuracy  and  efficiency  in  mammog¬ 
raphy,  several  studies  have  been  performed  by  using  the  CAD  systems.  These  studies  have 
demonstrated  that  with  the  appropriate  assistance  of  CAD  systems,  radiologists  could 
either  detect  more  subtle  cancers  in  a  screening  environment  (8,9)  or  increase  the  accuracy 
of  distinguishing  malignant  lesions  from  those  that  are  benign  (10-12).  While  some 
authors  (13-15)  indicated  that  CAD  did  not  substantially  decrease  the  specificity  levels  of 
the  radiologists,  others  (16,17)  indicated  that  current  CAD  systems  could  significantly 
decrease  diagnostic  accuracy  and  efficiency  of  radiologists  due  to  high  false-positive 
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detection  rates.  As  there  is  difficulty  in 
comparing  the  performance  of  different 
CAD  schemes  developed  at  various  insti¬ 
tutions  (18),  the  results  of  these  studies 
are  not  easily  comparable,  since  different 
CAD  schemes,  radiologists,  and  cases 
were  included.  Authors  of  these  studies 
did  not  address  in  detail  how  CAD  could 
affect  the  diagnostic  performance  of  the 
observers  or  the  level  of  CAD  that  may  be 
required  to  be  widely  acceptable  as  a 
helpful  tool  in  the  clinical  environment. 

Researchers  have  suggested  that  large- 
scale  experiments  are  needed  to  assess 
the  effect  of  CAD  (eg,  the  false-positive 
identifications)  on  the  diagnostic  accu¬ 
racy  of  radiologists  (19).  Some  doubt  re¬ 
mains  as  to  whether  CAD  systems  might 
increase  the  number  of  unnecessary  fol¬ 
low-up  examinations  or  biopsies  and 
thereby  offset  the  benefits  from  the  po¬ 
tential  gains  in  sensitivity  (20). 

The  effect  of  precuing  images  (high¬ 
lighting  suspicious  areas)  has  been  of 
great  interest  in  the  field  of  perception 
psychology  in  general  (21,22)  and  of  di¬ 
agnostic  radiology  in  particular  (23-25). 
Much  of  the  work  was  associated  with 
attempts  to  improve  tumor  detection  on 
x-ray  images  of  the  chest.  In  a  series  of 
carefully  designed  experiments,  Krupin- 
ski  et  al  (26)  demonstrated  that  in  a  cued 
environment,  performance  of  radiolo¬ 
gists  in  detecting  true-positive  lung  nod¬ 
ules  that  had  not  been  cued  was  degraded 
substantially.  The  shapes  of  abnormali¬ 
ties  (ie,  masses  and  microcalcification 
clusters)  and  the  complexity  of  the  back¬ 
ground  tissue  seen  on  mammograms  are 
somewhat  different  from  those  of  lung 
nodules  and  the  surrounding  back¬ 
ground  breast  parenchyma.  Therefore,  it 
is  not  clear  how  CAD  cuing  may  affect 
the  performance  of  radiologists  in  mam¬ 
mography. 

The  purpose  of  our  study  was  to  assess 
the  performance  of  radiologists  in  the  de¬ 
tection  of  masses  and  microcalcification 
clusters  on  digitized  mammograms  in  a 
CAD  environment  after  modulating  cu¬ 
ing  sensitivity  levels  and  false-positive 
rates. 


MATERIALS  AND  METHODS 

Seven  board-certified  radiologists  (includ¬ 
ing  M.A.G.,  C.A.B.,  C.M.H.,  L.A.H.,  T.S.C.) 
with  a  minimum  of  3  years  experience  in 
the  interpretation  of  mammograms  partic¬ 
ipated  in  this  observer  performance  study. 
None  of  the  seven  observers  had  partici¬ 
pated  in  the  case  selection  process.  All  im¬ 
ages  used  in  this  study  were  selected  from  a 


large  and  diverse  image  database  established 
at  Magee  Womens  Hospital,  with  institu¬ 
tional  review  board  approval  and  exemp¬ 
tion  of  patient  consent.  The  original  data¬ 
base  contained  mammograms  that  were 
collected  mainly  from  several  thousand  pa¬ 
tients  undergoing  routine  mammographic 
screening  at  three  medical  centers  (27). 

All  positive  masses  were  verified  at  bi¬ 
opsy.  All  negative  cases  were  rated  by  ra¬ 
diologists  according  to  the  level  of  con¬ 
cern  by  using  standard  Breast  Imaging 
Reporting  and  Data  System,  or  BI-RADS, 
recommendations.  The  negative  cases 
had  been  diagnosed  during  at  least  two 
subsequent  follow-up  examinations.  Al¬ 
though  we  routinely  acquire  four  images 
in  a  single  examination  (two  views  of 
each  breast),  for  some  cases  in  our  digi¬ 
tized  database,  we  have  only  two  images 
of  one  breast  due  to  a  variety  of  clinical 
reasons.  By  using  an  established  digitiza¬ 
tion  protocol,  all  mammograms  were  dig¬ 
itized  with  a  laser-film  digitizer  (Lumisys, 
Sunnyvale,  Calif),  with  a  pixel  size  of 
100  X  100  (im  and  12-bit  digital-value 
resolution.  The  quality  of  the  digitizer 
was  monitored  routinely  to  ensure  that 
in  the  optical  density  range  of  0.2-3.2, 
digital  values  were  linearly  proportional 
to  optical  densities  (28). 

The  selection  of  subtle  or  difficult  cases 
included  several  steps.  First,  we  selected  a 
large  set  of  positive  cases  (200  in  this 
experiment)  for  which  the  output  scores 
generated  by  the  CAD  scheme  were  low 
for  the  likelihood  that  the  abnormality  in 
question  was  present  (27).  Similarly,  we 
used  a  set  of  suspicious  negative  cases  (80 
in  this  experiment)  for  which  CAD  scores 
were  high  for  the  likelihood  that  a  mass 
or  a  cluster  of  microcalcifications  or  both 
were  present.  Then,  two  experienced  ob¬ 
servers  pruned  the  data  set  by  means  of 
visual  inspection  on  the  same  display  as 
that  used  in  the  study  with  the  "true  di¬ 
agnosis"  to  select  the  final  120  cases.  The 
total  number  of  positive  cases  was  se¬ 
lected  to  include  a  reasonable  mix  of  be¬ 
nign  and  malignant  cases  of  single  and 
multiple  abnormalities,  with  a  minimum 
of  25  malignant  cases  of  each  of  the  ab¬ 
normalities. 

The  resources  that  were  required,  in 
terms  of  radiologist  effort  (reading  time), 
were  a  factor  in  limiting  the  number  of 
cases  to  120  and  the  reading  modes  to 
five.  In  85  cases,  mammograms  depicted 
either  masses  or  clusters  of  microcalcifi¬ 
cations  or  both,  and  35  cases  were  nega¬ 
tive  for  these  abnormalities.  In  10  of  the 
positive  cases,  both  a  mass  and  a  micro¬ 
calcification  cluster  were  depicted.  In  all 
other  positive  cases,  only  one  abnormal¬ 


ity  (either  a  mass  or  a  cluster)  was  de¬ 
picted.  Hence,  the  positive  cases  con¬ 
sisted  of  38  verified  microcalcification 
clusters  and  57  verified  masses.  Biopsy 
results  indicated  that  27  of  clusters  and 
39  of  masses  were  malignant,  while  the 
remaining  11  clusters  and  18  masses  were 
benign.  Since  we  were  interested  in  the 
detection  (not  classification)  of  abnor¬ 
malities,  cases  were  selected  on  the  basis 
of  subtleness  of  the  depicted  abnormal¬ 
ity,  and  no  attempt  was  made  to  balance 
the  number  of  benign  and  malignant 
cases  in  the  dataset.  Although  study  find¬ 
ings  suggested  that  to  preserve  subtle  mi¬ 
crocalcifications,  mammograms  should 
be  digitized  with  pixel  sizes  of  50  x  50 
ixm  or  less  (15,29),  all  microcalcification 
clusters  in  this  study  were  detectable 
with  our  CAD  scheme.  In  addition,  we 
verified  that  all  clusters  were  visible  on 
images  that  were  digitized  with  100  X 
100  (xm  pixel  size. 

In  this  study,  radiologists  were  asked  to 
detect  masses  and  microcalcification  clus¬ 
ters  on  digitized  mammograms  displayed 
on  a  monitor.  In  most  of  the  120  cases  (n  = 
89),  two  contralateral  images  (the  same 
view  of  left  and  right  breasts)  were  dis¬ 
played  on  the  monitor  side  by  side.  For 
some  cases  (n  =  31),  only  a  single  image 
was  displayed.  The  latter  group  was  se¬ 
lected  from  the  cases  in  our  database  for 
which  we  have  only  two  views  of  one 
breast.  Hence,  only  one  view  was  displayed 
in  this  study,  following  our  study  protocol. 
Table  1  summarizes  by  type  and  verified 
finding  the  distribution  of  the  abnormali¬ 
ties  depicted  in  the  120  cases.  The  observ¬ 
ers  interpreted  each  case  only  on  the  basis 
of  the  images  displayed  on  the  monitor. 
No  images  from  previous  examinations  or 
other  clinical  information  about  the  pa¬ 
tients  was  made  available  during  the  inter¬ 
pretation. 

Each  radiologist  interpreted  the  same 
120  cases  five  times  by  using  five  display 
modes.  Suspicious  regions,  as  identified 
with  our  CAD  schemes,  were  cued  on  the 
images  in  all  modes,  with  the  exception 
of  the  first  mode,  in  which  no  CAD  re¬ 
sults  were  provided  to  the  radiologists. 
Two  true-positive  cuing  sensitivity  levels 
(90%  and  50%)  and  two  false-positive  cu¬ 
ing  rates  (0.5  or  2.0  per  image)  were  used 
in  these  four  cuing  modes  (Table  2).  Dur¬ 
ing  the  cuing  modes,  when  a  new  case 
was  loaded  into  the  display,  radiologists 
viewed  the  cued  images  first.  Then  they 
could  remove  the  prompts  from  the  dis¬ 
play  or  add  them  back  at  their  discretion. 

To  generate  the  cues,  CAD  schemes  de¬ 
veloped  by  our  group  (27)  were  applied 
to  these  209  images  (or  120  cases).  The 
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TABLE  1 

Number  of  Mammographic  Cases  in  Different  Categories 

Cases 

No.  of 
Masses 

No.  of 

Micro¬ 

calcification 

Clusters 

No.  of 
Masses 
and 

Clusters 

No.  of 
Negative 
Cases 

Total 

Cases 

M 

B 

M 

B 

M 

B 

Single-image 

10 

1 

11 

3 

1 

1 

4 

31 

Two-image 

20 

16 

7 

7 

8 

0 

31 

89 

Total 

30 

17 

18 

10 

9 

1 

35 

120 

Note. — B  =  benign,  M  = 

malignant. 

TABLE  2 

CAD  Cuing  Conditions  of  the  Five  Display  Modes 

Reading  Mode 

CAD  Cuing 

Cuing  Sensitivity 

Cuing  False-Positive  Rate 

1 

No 

Not  applicable 

Not  applicable 

2 

Yes 

0.9 

0.5 

3 

Yes 

0.9 

2.0 

4 

Yes 

O.S 

0.5 

5 

Yes 

0.5 

2.0 

schemes  use  filtering,  subtraction,  and 
topographic  region  growth  algorithms  to 
identify  suspicious  regions,  including 
masses  and  microcalcification  clusters 
(30,31).  Then,  by  using  nonlinear  multi¬ 
layer  multifeature  analyses,  two  artificial 
neural  networks,  which  have  been  opti¬ 
mized  in  our  previous  studies  and  re¬ 
ported  before  (32),  were  used  to  classify 
each  region  as  positive  or  negative  for  the 
presence  of  an  abnormality  in  question. 
One  network  was  designed  to  assess  re¬ 
gions  suspicious  for  masses,  and  the 
other  was  for  microcalcification  clusters. 
Before  applying  the  artificial  neural  net¬ 
works,  the  schemes  initially  identified 
133  suspicious  regions  for  microcalcifica¬ 
tion  clusters  and  831  for  masses.  Of  the 
133  clusters,  38  represented  true  clusters 
and  95  were  false  identifications  (or  a  rate 
of  0.45  [95  of  209  mammograms]  false¬ 
positive  detections  per  image).  Of  the 
831  mass  regions,  57  were  true-positive 
and  774  were  false-positive  (or  3.7  per 
image,  or  774  of  209  mammograms).  The 
artificial  neural  networks  were  then  ap¬ 
plied  to  classify  all  of  these  regions.  Each 
suspicious  region  received  a  likelihood 
score  (from  0  to  1)  for  being  positive.  The 
larger  the  score,  the  more  likely  the  re¬ 
gion  was  to  represent  a  true-positive  re¬ 
gion. 

Selection  of  true-positive  and  false-pos¬ 
itive  cues  for  each  display  mode  was  per¬ 
formed  separately.  Two  cuing  sensitivi¬ 
ties  (90%  and  50%)  were  applied  to 
masses  and  microcalcification  clusters. 


Each  abnormality  was  assigned  a  number 
(eg,  1-57  for  masses  or  1-38  for  clusters). 
A  computer  program  randomly  selected 
the  regions  to  be  cued  until  the  required 
number  was  reached  for  the  sensitivity 
level  being  evaluated.  In  display  modes  2 
and  3,  with  the  cuing  sensitivity  set  at 
90%,  51  of  57  true  masses  and  34  of  38 
clusters  were  selected.  In  modes  4  and  5, 
with  the  cuing  sensitivity  set  at  50%,  29 
of  57  masses  and  19  of  38  clusters  were 
selected.  Two  false-positive  cuing  rates 
(approximately  0.5  and  2.0  false-positive 
regions  per  image)  were  used.  Because  the 
number  of  false-positive  clusters  identi¬ 
fied  with  the  scheme  was  95,  all  of  these 
regions  were  used  in  display  modes  3  and 
5,  which  provided  a  false-positive  cuing 
rate  of  0.45  (95  of  209  mammograms).  In 
modes  2  and  4,  the  total  false-positive 
desired  cuing  rate  was  0.5  per  image, 
which  was  one-fourth  of  that  in  modes  3 
and  5.  Hence,  one-fourth  of  the  available 
false-positive  clusters  (24  of  95)  were  se¬ 
lected  on  the  basis  of  artificial  neural  net- 
work-generated  scores,  with  the  24  high¬ 
est  scoring  regions  being  selected  in 
descending  order  and  resulting  in  a  cuing 
rate  of  0.11  (24  in  209  mammograms). 

To  reach  the  overall  target  of  0.5  and 
2.0  false-positive  cuing  rates  per  image 
(including  both  mass  and  microcalcifica¬ 
tion  cluster  regions),  774  false-positive 
mass  regions  were  also  sorted  on  the  basis 
of  the  artificial  neural  network-gener¬ 
ated  scores.  Then,  82  of  the  highest  scor¬ 
ing  false-positive  regions  were  selected 


from  the  list  for  display  in  modes  2  and  4, 
and  324  false-positive  masses  were  se¬ 
lected  for  display  in  modes  3  and  5.  Thus, 
the  false-positive  cuing  rates  for  mass 
only  were  0.39  (82  in  209  mammograms) 
and  1.55  (324  in  209  mammograms)  per 
image,  respectively.  In  summary,  modes 
2  and  4  included  106  false-positive  cues 
(or  0.5  per  image),  and  modes  3  and  5 
included  419  false-positive  cues  (or  two 
per  image). 

Each  of  the  20  reading  sessions  for  in¬ 
dividual  observers  included  30  randomly 
selected  cases  that  used  one  reading 
mode.  To  eliminate  the  potential  for 
learning  effects,  the  order  of  display 
modes  (or  cuing  rates)  for  each  observer 
was  preselected  by  using  a  counterbal¬ 
anced  approach.  The  20  sessions  were  di¬ 
vided  into  four  blocks,  with  five  sessions 
each.  In  each  block,  one  observer  read 
five  sessions  with  five  different  modes  in 
random.  However,  at  each  session  num¬ 
ber  in  the  series  (eg,  session  6),  at  least 
five  observers  read  with  different  modes, 
and  no  more  than  two  readers  read  with 
the  same  mode.  For  example,  in  the  first 
session  for  all  the  observers,  observers 
started  reading  with  different  modes.  Be¬ 
cause  there  were  seven  observers  and  five 
display  modes,  observers  1-5  read  with 
modes  1-5,  respectively,  while  observer  6 
read  with  mode  3  and  observer  7  read 
with  mode  2.  Last,  a  study  management 
program  was  used  to  randomly  select  the 
cases  and  their  sequential  order  in  each 
session.  The  random  "seed"  used  in  the 
program  was  date  dependent.  Because 
each  observer  had  a  different  reading 
schedule,  the  cases  selected  in  each  ses¬ 
sion  (eg,  session  4)  and  their  sequential 
order  for  each  observer  were  different.  A 
minimum  time  delay  (10  days)  between 
the  two  consecutive  readings  of  the  same 
case  was  implemented. 

A  standard  landscape  workstation  (Sparc 
20;  Sun  Microsystems,  Mountain  View, 
Calif)  was  used  to  display  the  images.  Im¬ 
ages  were  not  preprocessed,  but  we  did  op¬ 
timize  the  contrast  of  each  image  by  means 
of  window  and  level  manipulation  for  op¬ 
timal  visual  display.  The  image  parameters 
were  then  fixed.  The  observers  could  not 
manipulate  the  contrast  and  brightness 
settings  during  the  readings.  Initially,  im¬ 
ages  were  displayed  on  the  screen  as  sub¬ 
sampled  (ie,  at  low  spatial  resolution)  to  fit 
the  screen  (with  approximately  1,200  x 
850  pixels).  With  zoom  and  roam  func¬ 
tions,  the  radiologists  were  able  to  view  the 
images  at  full  spatial  resolution  by  clicking 
the  appropriate  control  button  or  scroll 
bars.  A  "Display/Remove"  button  could  be 
used  to  superimpose  or  delete  the  CAD 
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cues  on  the  images.  Radiologists  could 
make  diagnostic  decisions  while  viewing 
either  subsampled  or  full-spatial-resolu¬ 
tion  images. 

Observers  were  asked  to  perform  and 
score  two  separate  tasks.  First,  they  were 
asked  to  identify  (detect)  suspicious  areas 
for  the  presence  of  an  abnormality  and 
then  classify  the  suspected  abnormality 
as  benign  or  malignant.  Once  a  radiolo¬ 
gist  pointed  to  and  clicked  the  cursor  on 
the  center  of  a  suspected  abnormality,  a 
scoring  window  appeared,  followed  by  a 
confidence-level  sliding  scale.  The  pro¬ 
gram  automatically  recorded  all  of  the 
diagnostic  information  entered  by  the  ra¬ 
diologist,  including  the  type  of  detected 
abnormality  (mass  or  microcalcification 
cluster),  location  (the  center  of  the  de¬ 
tected  region),  and  two  estimated  likeli¬ 
hood  scores  (from  0  to  1)  for  the  detection 
(presence  or  absence)  and  classification 
(benign  or  malignant)  of  any  identified  re¬ 
gion  that  was  suspected  of  an  abnormality. 
The  likelihood  scores  were  used  to  generate 
the  free-response  receiver  operating  char¬ 
acteristic  curves. 

The  results  of  each  observer,  abnormal¬ 
ity,  and  display  mode  were  qualitatively 
viewed,  and  free-response  receiver  oper¬ 
ating  characteristic  curves  were  plotted 
for  individual  readers  and  modes,  as  well 
as  for  pooled  confidence  ratings  for  all 
readers  since  their  general  patterns  were 
consistent.  For  testing  the  hypothesis  of 
equality  of  the  free-response  receiver  op¬ 
erating  characteristic  curves  (or  the  de¬ 
tection  sensitivities  at  the  same  false-pos¬ 
itive  rates)  across  four  CAD  cuing  modes, 
we  compared  sensitivities  among  the 
curves  at  10  false-positive  rates  that  were 
uniformly  distributed  over  the  measured 
range.  Sensitivity  levels  across  modalities 
were  compared  by  using  a  repeated  mea¬ 
sures  logistic  regression  model,  where  the 
binary  outcome  variable  was  replicated 
over  patients,  and  the  independent  vari¬ 
ables  included  reader  and  modality.  Esti¬ 
mation  was  done  by  using  a  Generalized 
Estimating  Equation  approach  (33). 

In  addition,  we  analyzed  the  changes 
in  performance  indices  (ie,  the  number  of 
missed  true-positive  regions  in  the  cued 
or  noncued  areas)  for  the  two  sensitivity 
levels  (50%  and  90%)  and  the  two  false¬ 
positive  cuing  rates  (0.5  and  2.0  per  im¬ 
age).  The  hypotheses  of  the  equality  of 
the  number  of  missed  abnormalities  were 
also  tested  by  using  a  repeated  measures 
logistic  regression,  with  reader  and  mo¬ 
dality  in  the  model.  To  examine  poten¬ 
tial  biases  for  reading  the  same  case  five 
times,  the  reading  results  were  reordered 
and  analyzed  for  all  cases  that  were  read 
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Figure  1.  Free-response  receiver  operating  characteristic  curves  for  the  average  detection  of 
mammographic  abnormalities  (including  both  masses  and  microcalcification  clusters)  by  seven 
participating  radiologists  using  five  display  modes.  O  =  mode  1,  ■  =  mode  2,  A.  =  mode  3,  *  = 
mode  4,  and  ♦  =  mode  5. 


the  first  time  (regardless  of  mode)  as  one 
group  and  the  second  time  as  another 
groups,  and  so  on.  Performance  curves 
were  computed  separately  for  these  five 
mutually  exclusive  groups  and  were  com¬ 
pared  by  using  the  analysis  of  variance 
test. 

RESULTS 


Performance  curves  varied  among  ob¬ 
servers,  but  the  general  pattern  was  con¬ 
sistent.  Figures  1-3  demonstrate  curves  of 
the  average  performance  of  the  seven 
observers  for  the  detection  of  either  ab¬ 
normality,  masses,  or  microcalcification 
clusters,  respectively.  As  can  be  noted 
from  the  noncued  results  (mode  1),  the 
task  in  general  was  challenging  because 
of  the  display  environment,  the  subtlety 
of  the  abnormalities,  or  both. 

Figure  1  demonstrates  that  both  sensi¬ 
tivity  and  specificity  of  the  CAD  results 
affected  observer  performance.  The  dif¬ 
ferences  among  modes  2-5  were  highly 
significant  ( P  <  .01).  However,  the  results 
showed  different  patterns  for  the  detec¬ 
tion  of  masses  compared  with  microcal¬ 
cifications.  In  the  case  of  masses  (Fig  2), 
specificity  of  the  CAD  results  (or  cuing 
false-positive  rate)  affected  the  observers 
in  a  more  significant  manner.  The  differ¬ 


ences  among  modalities  were  statistically 
significant  (P  <  .01),  with  the  perfor¬ 
mance  decreasing  as  the  number  of  cued 
regions  increased.  In  the  case  of  clusters 
(Fig  3),  observer  performance  was  af¬ 
fected  to  a  greater  extent  by  the  cuing 
sensitivity.  The  combination  of  case  sub¬ 
tlety  and  viewing  of  soft  copies  rendered 
the  test  of  microcalcification  cluster  de¬ 
tection  so  difficult  that  only  approxi¬ 
mately  60%  were  detected  without  cuing 
or  with  cuing  at  low  sensitivity  (modes  4 
and  5).  With  the  support  of  highly  sensi¬ 
tive  cues,  the  performance  improved  to  a 
detection  rate  of  approximately  75%  (P  < 
.01). 

Highly  accurate  cuing  (ie,  90%  sensi¬ 
tivity  and  0.5  false-positive  cue  per  im¬ 
age)  helped  the  observers  to  improve 
their  performance,  compared  with  the 
noncued  environment  (P  <  .01).  As  the 
accuracy  of  the  cuing  decreased,  so  did 
the  performance  of  the  typical  observer. 
This  effect  continued  for  either  detection 
task,  but  the  detection  of  microcalcifica¬ 
tion  clusters  was  more  significantly  af¬ 
fected  by  sensitivity  of  the  cuing  in  our 
case.  Most  important,  perhaps,  our  study 
results  clearly  indicate  that  poorly  per¬ 
forming  CAD  (Fig  1)  can  result  in  signif¬ 
icant  degradation  of  observer  perfor¬ 
mance  (P  <  .01). 
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False-positive  masses  per  image  marked  by  the  observers 

Figure  2.  Free-response  receiver  operating  characteristic  curves  for  the  average  mass  detection 
by  seven  radiologists  using  five  display  modes.  O  =  mode  1,  ■  =  mode  2,  A  =  mode  3,  *  =  mode 
4,  and  ♦  =  mode  5. 
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False-positive  clusters  per  image  marked  by  the  observers 
Figure  3.  Free-response  receiver  operating  characteristic  curves  for  the  average  microcalcifica¬ 
tion  cluster  detection  by  seven  radiologists  using  five  display  modes.  O  =  mode  1,  ■  =  mode  2, 
A  =  mode  3,  *  =  mode  4,  and  ♦  =  mode  5. 


Table  3  demonstrates  the  number  of 
CAD-cued  abnormalities  that  were  iden¬ 
tified  by  each  radiologist  in  mode  1  (non¬ 


cuing)  but  were  missed  in  other  (cued) 
modes.  Some  increases  in  rejection  rates 
of  true-positive  regions  were  observed 


when  the  number  of  cues  increased,  but 
the  results  were  not  significant  (P  >  .05). 

Table  4  summarizes  the  number  of 
missed  abnormalities  in  noncued  areas 
during  CAD-cued  observations.  The  table 
data  show  that  for  the  highly  sensitive 
cuing  modes  (eg,  modes  2  and  3,  where 
only  10%  of  true-positive  regions  were 
not  cued),  the  majority  of  missed  abnor¬ 
malities  (>94%)  were  also  missed  in 
mode  1.  As  CAD  cuing  sensitivity  was 
reduced  to  50%,  the  average  number  of 
missed  abnormalities  in  noncued  areas 
increased  significantly  (P  <  .05).  More 
important,  approximately  30%  of  these 
regions  were  detected  by  the  radiologists 
in  mode  1.  The  increase  of  the  false-pos¬ 
itive  cuing  rate  from  0.5  to  2.0  per  image 
(mode  4  vs  mode  5,  respectively)  in¬ 
creased  the  number  of  missed  abnormal¬ 
ities  in  noncued  areas,  from  an  average  of 
14.4  to  18.0,  which  was  not  significant 
(P  =  .16)  and  most  likely  due  to  the  small 
sample  size.  In  this  case,  the  observers 
also  missed  significantly  more  regions 
that  were  detected  in  mode  1  (P  =  .03).  In 
general,  the  number  of  missed  abnormal¬ 
ities  (false-negative  rate)  in  the  noncued 
areas  increases  as  the  cuing  sensitivity 
decreases  and  the  false-positive  cuing 
rate  increases.  As  a  result,  mode  5  had  the 
highest  miss  rate  in  noncued  areas. 
When  we  compared  detection  perfor¬ 
mances  for  benign  and  malignant  abnor¬ 
malities,  the  latter  group  was  somewhat 
better  detected  (probably  due  to  differ¬ 
ences  in  subtleness),  but  the  differences 
between  modes  were  similar  to  those  of 
the  benign  group. 

The  pooled  classification  confidence 
ratings  (malignant  vs  benign)  provided 
by  the  seven  observers  on  all  identified 
true-positive  regions  for  each  mode  were 
used  to  generate  and  compare  the  area 
under  the  receiver  operating  characteris¬ 
tic  curve  (Az)  values  for  the  different 
modes  (rocfit;  Metz  CE,  Herman  BA, 
Shen  JH,  University  of  Chicago,  II)  (34). 
Az  values  were  estimated  by  using  maxi¬ 
mum  likelihood  estimation  under  the 
binormal  assumption.  The  Az  values  for 
the  classification  performance  over  all 
readers  were  0.70  ±  0.02,  0.69  ±  0.02, 
0.69  ±  0.02,  0.70  ±  0.02,  and  0.68  ±  0.02 
for  modes  1-5,  respectively.  Comparison 
of  each  pair  of  modes  did  not  result  in 
any  significant  differences  (P  >  .05). 
Hence,  once  the  abnormality  was  identi¬ 
fied  (detected),  the  ability  of  the  observer 
to  distinguish  between  benign  versus  ma¬ 
lignant  abnormalities  (classification)  was 
not  significantly  affected  (P  >  .05)  by  the 
cuing  mode  or  lack  thereof.  Although 
there  were  differences  in  performance 
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among  the  observers,  we  did  not  identify 
any  correlation  of  either  the  detection  or 
classification  tasks  with  observer  experi¬ 
ence,  as  measured  by  the  number  of  years 
of  interpreting  mammograms  or  the  av¬ 
erage  number  of  mammograms  inter¬ 
preted  per  year.  The  performance  trends 
we  observed  were  consistent  for  all  ob¬ 
servers. 

The  minimum  time  delay  between  two 
consecutive  readings  of  the  same  case  by 
the  same  observer  was  set  at  10  days,  but 
the  actual  time  delay  ranged  from  12  to 
154  days,  with  an  average  time  delay  of 
48  days.  When  we  examined  the  results 
after  reordering  the  cases  by  their  order 
of  appearance  (ie,  first  time,  second  time, 
etc),  regardless  of  the  mode,  no  signifi¬ 
cant  ( P  >  .8)  difference  between  the 
groups  was  identified  (Fig  4).  Similar  per¬ 
formance  patterns  were  observed  when 
31  cases  that  included  only  one  image 
were  excluded  from  the  analyses,  and  the 
detection  results  were  not  significantly 
altered  in  any  comparison  between  those 
for  the  whole  group  (120  cases)  and  the 
subset  of  89  cases  containing  two  images 
(P  >  .5). 

DISCUSSION 


This  preliminary  study  has  to  be  clearly 
viewed  as  a  study  performed  under  labo¬ 
ratory  conditions.  Before  any  generaliza¬ 
tion  of  the  results  is  contemplated,  it  has 
to  be  considered  that  conditions  in  this 
study  were  removed  from  the  typical 
clinical  environment.  However,  the  con¬ 
sistency  of  the  patterns  observed  for  the 
individual  readers  and  the  group  as  a 
whole  warrant  further  assessment  of  the 
affect  of  CAD  performance  on  the  ob¬ 
server. 

Clearly,  the  expectation  that  observers 
can  readily  and  easily  discard  most  false¬ 
positive  cues  regardless  of  their  presenta¬ 
tion  or  prevalence  was  not  what  we 
found  (14).  Both  true-  and  false-positive 
cues  affected  the  results.  The  effect  was 
also  dependent  on  the  type  of  abnormal¬ 
ity  and  its  subtleness  (detection  diffi¬ 
culty).  Despite  significant  reader,  case, 
and  mode  variability,  the  results  we  ob¬ 
tained  were  consistent  and  interpretable. 
As  expected,  at  low  specificity  levels,  all 
CAD-cued  modes  aid  in  increasing  sensi¬ 
tivity  of  observers,  as  can  be  seen  from 
the  tendency  to  cross  the  noncuing  per¬ 
formance  curve.  This  observation  is  con¬ 
sistent  with  some  of  the  results  previ¬ 
ously  reported  by  others,  but  it  may  not 
be  clinically  relevant  in  situations  in 
which  most  abnormalities  are  not  as  dif¬ 
ficult  to  detect  as  those  in  this  study. 


TABLE  3 

Number  of  Missed  Abnormalities  Identified  as  Suspicious  in  Mode  1  (Noncued) 
but  Missed  in  Other  Modes  Despite  the  Fact  that  the  Abnormality  in  Question 
Was  Cued 


Reader 

Mode  2 

Mode  3 

Mode  4 

Mode  5 

1 

5 

5 

3 

3 

2 

5 

4 

4 

3 

3 

5 

6 

3 

6 

4 

3 

1 

5 

4 

5 

1 

9 

5 

n 

6 

5 

4 

8 

5 

7 

3 

1 

4 

2 

Average 

3.9 

4.3 

4.6 

4.9 

TABLE  4 

Number  of  Missed  Abnormalities  in  Noncued  Regions 

Reader 

Mode  2 

Mode  3 

Mode  4 

Mode  5 

1 

5(D 

5  0) 

13(3) 

14(5) 

2 

6(0) 

8(0) 

19(2) 

21  (7) 

3 

5  0) 

5(0) 

11  (2) 

15(3) 

4 

5(0) 

6(0) 

19(3) 

25  (5) 

5 

6(0) 

4(0) 

10(4) 

13(5) 

6 

7(1) 

7(2) 

14(4) 

20  (9) 

7 

6(0) 

5(0) 

15(3) 

18(6) 

Average 

5.7  (0.4) 

5.7  (0.4) 

14.4  (3.0) 

18.0(5.7) 

Note. — Data 
(noncued). 

in  parentheses  are  the  number  of  missed  regions  that  were  detected 

in  mode  1 
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False-positives  per  image  marked  by  the  observers 

Figure  4.  Free-response  receiver  operating  characteristic  curves  for  the  average  detection  of 
abnormalities  by  seven  radiologists  as  a  function  of  the  order  of  appearance:  O  =  first  time,  ■  = 
second  time,  ▲  =  third  time,  *  =  fourth  time,  and  ♦  =  fifth  time,  regardless  of  the  reading  mode. 


Our  results  suggest  that  the  use  of  a 
CAD-cued  environment  during  the  inter¬ 
pretation  of  mammograms  has  to  be 


carefully  investigated  and  fully  under¬ 
stood  before  it  is  widely  accepted  in  a 
routine  clinical  practice.  In  particular, 
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one  should  consider  the  cuing  perfor¬ 
mance  level  of  the  scheme  itself  and  the 
potential  increase  in  missed  abnormali¬ 
ties  in  noncued  regions,  because  the  pos¬ 
sible  liability  associated  with  false-nega¬ 
tive  interpretations  far  exceeds  that  of 
false-positive  readings  (26). 

The  general  consistency  of  our  results 
is  somewhat  surprising  in  view  of  the  fact 
that  cuing  rates  were  maintained  only  for 
short  durations  (within  a  single  session 
of  30  cases).  Unlike  the  display  environ¬ 
ment,  the  CAD  results  in  our  study  emu¬ 
lated  what  can  be  expected  by  using  cur¬ 
rent  levels  of  CAD  performances,  as  well 
as  what  one  hopes  to  achieve  by  using 
CAD  in  the  future.  The  range  of  CAD 
performances  that  were  used  for  cuing  at 
90%  sensitivity  at  0.5  false-positive  iden¬ 
tification  per  image  to  50%  sensitivity  at 
two  false-positive  identifications  per  im¬ 
age  clearly  makes  this  study  interesting 
in  enabling  an  assessment  of  what  could 
be  expected  with  improved  CAD  results. 
It  is  interesting  to  note  that  for  all  display 
modes,  the  use  of  CAD  cuing  with  either 
high  or  low  performance  had  a  limited 
effect  on  observers  when  they  operated  at 
a  conservative  level.  Namely,  they  indi¬ 
cated  only  regions  they  were  confident 
about,  and,  therefore,  they  had  low  false¬ 
positive  rates.  This  stemmed  largely  from 
the  fact  that  the  CAD  cuing  depicted 
mainly  areas  on  the  image  that  were  truly 
appropriate  (reasonable)  as  suspicious.  As 
observers  loosened  their  criteria  (ie,  indi¬ 
cated  a  larger  number  of  suspicious  re¬ 
gions),  the  CAD-cuing  performance  af¬ 
fected  observers  in  a  more  significant 
manner.  Namely,  the  use  of  a  better  per¬ 
forming  cuing  scheme  significantly  im¬ 
proved  observer  performance,  while  the 
use  of  poorly  performing  cuing  schemes 
significantly  degraded  observer  perfor¬ 
mance. 

Analysis  of  the  data  sets  after  the  reor¬ 
der  of  cases  by  appearance  indicates  that 
learning  effects,  if  any,  were  not  a  signif¬ 
icant  factor  in  this  study.  Although  all 
selected  abnormalities  in  this  study  were 
detectable  with  CAD  schemes  and  visible 
on  displayed  images,  the  relatively  low 
detection  levels  of  the  seven  participat¬ 
ing  observers  in  the  case  of  subtle  clus¬ 
tered  microcalcifications  suggest  that  this 
task  is  likely  to  be  a  continuing  challenge 
when  soft  copy  is  used  for  this  purpose. 
We  are  not  aware  of  any  comprehensive 
study  in  which  this  issue  was  assessed, 
and  our  results,  albeit  preliminary,  sug¬ 
gest  that  such  a  study  should  be  per¬ 
formed. 

Despite  the  limited  information  (no 
prior  studies  or  reports  and  only  a  single 


view  for  each  breast)  and  the  fact  that 
different  abnormalities  were  detected  in 
each  mode,  the  classification  perfor¬ 
mances  of  determining  that  an  identified 
abnormality  was  either  benign  or  malig¬ 
nant  were  reasonable  and  consistent.  It 
was  encouraging  to  learn  that  once  de¬ 
tected,  the  task  of  classifying  the  abnor¬ 
mality  as  benign  or  malignant  was  not 
affected  by  the  detection  cuing  perfor¬ 
mance,  which  points  to  the  fact  that 
these  are  likely  to  be  two  distinct  and 
largely  independent  tasks.  Our  CAD 
scheme  was  designed  solely  for  detection 
purposes.  Other  classification  schemes  (12) 
have  been  shown  to  perform  well,  and, 
when  used  during  interpretation,  signifi¬ 
cantly  improved  tissue  classification  per¬ 
formance  of  the  observers  (10,11). 

The  overall  detection  sensitivity  of  the 
radiologists  was  in  general  relatively  low 
compared  with  that  observed  in  the  clin¬ 
ical  environment.  This  may  be  due  to  the 
fact  that  most  of  the  cases  selected  for 
this  study  were  subtle,  and  reading  was 
performed  on  soft  copy  by  using  a  lim¬ 
ited  number  of  views  without  prior  ex¬ 
aminations  being  available  for  compari¬ 
son.  We  note  a  difference  between  this 
and  other  reported  studies  (14,15)  where 
observers  could  view  both  film  hard-copy 
images  and  low-spatial-resolution  soft- 
copy  images  with  CAD-cued  areas  on  the 
screen.  Not  providing  film  hard-copy  im¬ 
ages  to  the  observers  could  have  been  a 
significant  factor  in  lowering  detection 
sensitivity  in  this  study.  This  resulted  in  a 
crossing  of  the  performance  curves  for 
the  detection  of  microcalcifications  (Fig 
3),  since  the  noncued  mode  exhibited  a 
"capping"  effect  (an  imposed  upper 
limit)  that  was  removed  with  the  aid  of 
CAD  cuing.  This  does  not  invalidate  any 
of  the  analyses  or  observations  made  in 
this  study.  Despite  the  generally  low  level 
of  performance  and  the  high  prevalence 
of  abnormalities  in  our  data  set,  we  be¬ 
lieve  that  on  a  relative  scale,  the  results 
concerning  the  general  trends  we  ob¬ 
served  are  valid.  We  emphasize  that  our 
study  design  called  for  a  change  in  mode 
(hence,  abnormality  rates)  at  each  ses¬ 
sion.  The  effects  we  observed  under  these 
conditions  are  probably  different  and 
likely  minimized,  as  compared  with  those 
in  a  study  design  in  which  each  mode  is 
read  to  its  completion  before  any  prevalent 
changes  (ie,  change  to  a  different  mode). 

In  conclusion,  our  preliminary  study 
results  indicate  that  in  a  laboratory  envi¬ 
ronment,  observer  performance  in  the 
detection  of  subtle  mammographic  ab¬ 
normalities  is  significantly  affected  by 
the  inherent  performance  of  a  cuing  sys¬ 


tem.  High-performance  cuing  systems 
can  significantly  improve  observer  per¬ 
formance.  On  the  other  hand,  low-per¬ 
formance  cuing  systems  can  significantly 
degrade  observer  performance.  These 
findings,  together  with  the  intermode 
consistency  we  observed,  are  important, 
since  there  could  be  diagnostic  implica¬ 
tions  associated  with  the  inappropriate 
use  of  or  reliance  on  CAD  results  during 
the  interpretation.  These  issues  have  to 
be  further  investigated  with  larger  data 
sets  and  a  more  closely  simulated  clinical 
environment. 
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Performance  gain  in  computer-assisted  detection  schemes 
by  averaging  scores  generated  from  artificial  neural  networks 
with  adaptive  filtering 
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The  authors  investigated  a  new  method  to  optimize  artificial  neural  networks  (ANNs)  with  adaptive 
filtering  used  in  computer-assisted  detection  schemes  in  digitized  mammograms  and  to  assess 
performance  changes  when  averaging  classification  scores  from  three  sets  of  optimized  schemes. 
Two  independent  training  and  testing  image  databases  involving  978  and  830  digitized  mammo¬ 
grams,  respectively,  were  used  in  this  study.  In  the  training  data  set,  initial  filtering  and  subtraction 
resulted  in  the  identification  of  592  mass  regions  and  3790  suspicious,  but  actually  negative  regions. 
These  regions  (including  both  true-positive  and  negative  regions)  were  segmented  into  three  subsets 
three  times  based  on  the  calculation  of  the  values  of  three  features  as  segmentation  indices.  The 
indices  were  “mass”  size  multiplied  by  their  digital  value  contrast,  conspicuity,  and  circularity. 
Nine  ANN-based  classifiers  were  separately  optimized  using  a  genetic  algorithm  for  each  subset  of 
regions.  Each  region  was  assigned  three  classification  scores  after  applying  the  three  adaptive 
ANNs.  The  performance  gain  of  the  CAD  scheme  after  averaging  the  three  scores  for  each  suspi¬ 
cious  region  was  tested  using  an  independent  data  set  and  a  ROC  methodology.  The  experimental 
results  showed  that  the  areas  under  ROC  curves  ( Az )  for  the  testing  database  using  three  sets  of 
optimized  ANNs  individually  were  0.84±0.01,  0.83±0.01,  and  0.84±0.01,  respectively.  The 
between-index  correlations  of  three  Az  values  were  0.013,  —0.007,  and  0.086.  Similar  to  averaging 
diagnostic  ratings  from  independent  observers,  by  averaging  three  ANN-generated  scores  for  each 
testing  region,  the  performance  of  the  CAD  scheme  was  significantly  improved  (p<  0.001)  with  Az 
value  of  0.95  ±0.01.  ©  2001  American  Association  of  Physicists  in  Medicine. 
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I.  INTRODUCTION 

A  number  of  computer-assisted  detection  (CAD)  schemes 
have  been  developed  in  recent  years  to  detect  masses  and 
microcalcification  clusters  depicted  in  digitized 
mammograms.1-10  Many  researchers  believe  that  eventually 
these  CAD  schemes  will  help  radiologists  to  significantly 
improve  their  diagnostic  accuracy  and  efficiency  in  diagnos¬ 
ing  breast  cancers  at  an  earlier  stage.11-13  Others  question 
whether  the  high  false-positive  rates  resulting  from  the  CAD 
schemes  could  generate  a  large  number  of  unnecessary  re¬ 
calls  or  possibly  biopsies,  which  might  offset  the  possible 
gains  in  detection  sensitivity.14,15  Because  of  this  potential 
negative  effect  (i.e.,  high  false-positive  rate)  on  diagnostic 
performance,  significant  effort  has  been  invested  in  an  at¬ 
tempt  to  improve  CAD  performance.16-19  In  order  to  achieve 
high  detection  sensitivity,  CAD  schemes  typically  identify  a 
large  number  of  suspicious,  but  actually  negative  regions  at 
the  initial  detection  stage.  Hence,  an  important  task  in  CAD 
development  is  to  improve  accuracy  of  classifying  a  large 
number  of  identified  regions.  Previous  studies  in  this  area 
focused  mainly  on  searching  for  an  effective  classifier  in¬ 
cluding,  but  not  limited  to:  a  linear  discriminant  function,5  an 
improved  artificial  neural  network  (ANN),20  a  wavelet 


transformation,3  a  set  enumeration  decision  tree,21  a  Baye¬ 
sian  belief  network,22  and  a  knowledge-based  expert 
system.23  Other  efforts  concentrated  on  determining  a  small, 
but  optimal  set  of  features  that  include  morphological 
features,10  texture  features,16  and  derivative-based  features.4 

Because  of  the  complexity  and  large  variability  of  the 
abnormalities  in  question  and  the  surrounding  tissue  struc¬ 
tures,  it  is  quite  difficult  for  a  single  universal  scheme  to 
accurately  classify  suspicious  regions  using  a  limited  number 
of  correlated  features.24,25  To  address  this  problem,  two  ap¬ 
proaches  have  been  investigated  to  date.  The  first  one  is  to 
segment  the  images  or  suspicious  regions  into  different 
groups  based  on  specific  predetermined  image  characteristics 
(e.g.,  “image  difficulty  indices”)  and  then  optimize  separate 
schemes  with  adaptive  filtering  for  each  group  (class)  of  im¬ 
ages.  Previous  studies  using  this  approach  suggested  prom¬ 
ising  results  for  a  rule-based  CAD  scheme26  and  for  a 
wavelet-transform  based  CAD  scheme.27  The  second  ap¬ 
proach  that  has  been  explored  is  to  combine  (or  average)  the 
detection  results  from  different  noncorrelated  classifiers, 
such  as  the  averaging  of  detection  scores  from  a  rale-based 
and  ANN-based  classifiers,17  or  those  of  an  ANN  and  a  set 
enumeration  tree.21  Similar  to  improving  diagnostic  accuracy 
by  averaging  ratings  from  replicated,  but  independent  read- 


2302  Med.  Phys.  28  (11),  November  2001 


0094-2405f2001/28(11)/2302/7/$18.00 


©  2001  Am.  Assoc.  Phys.  Med.  2302 


*  2303  Zheng  et  al .:  Performance  gain  In  computer-assisted  detection  schemes 


2303 


ings  or  from  different  readers,28,29  averaging  CAD  scores 
generated  by  different  classifiers  could  also  be  an  effective 
approach  to  improve  performance.17,21 

In  our  previously  reported  studies,21’26  image  databases 
were  somewhat  limited  and  the  computation  of  the  indices 
by  which  images  were  segmented  into  groups  was  quite 
complicated.  In  the  present  study,  we  combine  the  two  ap¬ 
proaches.  In  addition,  we  use  three  image  features  that  are 
well  defined,  easily  computable,  and  widely  used  in  CAD 
schemes  to  segment  the  image  ensemble  into  different 
groups.  This  study  focuses  on  detecting  masses  in  digitized 
mammograms.  Since  studies  have  shown  that  high- 
performing  CAD  cueing  could  significantly  improve  the  per¬ 
formance  of  radiologists  in  detecting  subtle  cancers13,30-32 
and  our  study  suggested  that  once  detected,  the  task  of  clas¬ 
sifying  masses  as  benign  or  malignant  was  not  affected  by 
the  CAD  detection  performance,  we  assume  here  that  detec¬ 
tion  and  classification  are  two  distinct  and  largely  indepen¬ 
dent  tasks.32  A  detailed  description  of  the  development  phase 
of  the  scheme  and  the  initial  test  using  a  large  independent 
data  set  are  presented. 

II.  MATERIALS  AND  METHODS 
A.  Image  databases 

Two  independent  image  databases  were  used  in  this  study. 
The  first  database  (used  as  the  training  database)  contains  a 
total  of  978  digitized  mammograms.  Of  these,  545  images 
were  acquired  on  patients  who  underwent  mammographic 
examinations  at  the  University  of  Pittsburgh  Medical  Center 
(Pittsburgh,  PA)  and  its  affiliated  hospitals  and  clinics  prior 
to  April  1997,  and  433  images  were  provided  to  us  by  an 
imaging  research  group  at  Washington  University  Medical 
School  (St.  Louis,  MO).  A  detailed  description  of  this  data¬ 
base  has  been  reported  elsewhere.22  The  second  image  data¬ 
base  (used  as  the  testing  database)  contains  830  images,  of 
which  528  were  provided  to  us  by  a  research  and  develop¬ 
ment  team  at  the  Eastman  Kodak  Company  (Rochester, 
NY)10  and  302  images  collected  more  recently  (>10/98)  on 
patients  undergoing  mammography  examinations  at  the  Uni¬ 
versity  of  Pittsburgh  Medical  Center.  Although  the  mammo¬ 
grams  originated  in  different  medical  facilitates,  these  were 
all  digitized  in  our  laboratory  using  a  laser-film  digitizer  (Lu- 
misys,  Sunnyvale,  CA)  with  a  pixel  size  of  100  /im 
X  100  /x m  and  12  bit  gray-level  resolution.  For  mass  detec¬ 
tion,  the  images  were  then  subsampled  (pixel  digital  value 
average)  by  a  factor  of  4  in  both  directions  to  generate  im¬ 
ages  of  approximately  600X450  pixels.  All  true-positive 
masses  depicted  in  these  images  were  pathologically  veri¬ 
fied,  and  the  locations  of  the  masses  were  marked  on  the 
images  by  radiologists. 

Each  image  was  processed  by  a  multilayer  topographic- 
based  CAD  scheme  previously  developed  in  our  laboratory.33 
Each  mammogram  was  processed  as  follows:  Using  dual- 
kernel  filtering,  subtraction,  and  simple  thresholding  meth¬ 
ods,  the  scheme  identifies  a  large  number  of  suspicious  mass 
regions.  A  set  of  image  features  is  then  extracted  from  the 
mammogram,  and  a  classifier  (i.e.,  artificial  neural  network) 


is  applied  to  assign  the  region  as  a  positive  or  negative  one. 
In  brief,  this  scheme  has  three  distinct  stages  for  the  identi¬ 
fication  of  masses.  The  first  stage  of  dual  kernel  filtering, 
subtraction,  and  labeling  resulted  in  the  selection  of  a  large 
number  of  suspicious  regions  (24  067  and  19154  regions 
when  applied  to  the  two  image  databases,  respectively,  or 
approximately  24  regions  per  image).  Based  on  local  contrast 
measurements,  the  second  stage  used  an  adaptive  region 
growth  algorithm  to  define  three  topographic  layers  for  each 
suspicious  region.  For  each  growth  layer,  a  set  of  simple 
intralayer  boundary  conditions  on  region  growth  ratio  and 
shape  factor  was  applied  to  eliminate  a  large  number  of  ini¬ 
tial  suspicious  regions.  After  the  second  stage,  the  number  of 
suspicious  regions  (including  both  positive  and  negative  re¬ 
gions)  decreased  to  4382  and  3623  (or  approximately  4.4 
regions  per  image)  in  the  training  and  testing  databases.  For 
each  suspicious  region,  a  set  of  image  features  was  automati¬ 
cally  computed  by  the  scheme.  Using  these  features,  the  third 
stage  of  the  CAD  scheme  used  a  three-layer  feed-forward 
ANN  to  classify  these  regions  as  positive  or  negative  for 
mass.24 

The  second  stage  of  the  scheme  identified  592  and  358 
suspicious  regions  that  depicted  verified  masses  in  the  train¬ 
ing  and  testing  databases,  respectively.  With  the  exception  of 
these  regions  that  matched  verified  masses,  all  other  regions 
that  were  identified  as  suspicious  by  the  scheme  at  this  stage 
were  determined  to  be  negative.  A  total  of  3790  and  3265 
negative  regions  were  identified  as  suspicious  (or  false¬ 
positive)  in  the  training  and  testing  databases,  respectively. 
For  each  region,  36  image  features  inside  the  suspicious  re¬ 
gion  (including  its  three  topographic  growth  layers33)  and  its 
surrounding  background  were  automatically  computed  by 
the  CAD  scheme.  These  features  include  mainly  geometri¬ 
cally  related  features,  such  as  region  size,  circularity,  or  nor¬ 
malized  standard  deviation  of  radial  length  and  intensity- 
related  features  (or  distribution  of  pixel  values),  such  as 
contrast,  standard  deviation,  and  skewness  of  pixel  values’ 
distribution  and  conspicuity.  The  definitions  and  the  methods 
of  computation  for  these  features  have  been  reported  in  sev¬ 
eral  previous  studies.22,24  To  reduce  the  potential  redundancy 
and  improve  the  robustness  of  the  scheme,  we  used  a  genetic 
algorithm  (GA)  to  select  an  optimal  subset  of  input  features 
to  be  used  in  the  ANN. 

B.  Database  segmentation 

The  basic  concept  of  adaptive  filtering  is  to  divide  suspi¬ 
cious  regions  (or  images)  into  several  groups  based  on  a 
computable  index  and  then  to  optimize  different  ANNs  for 
the  regions  (or  images)  in  each  group.  Although  several  com¬ 
plicated  indices  have  been  used  for  segmentation  with  some 
success,26,27  we  searched  here  for  new  indices.  The  selection 
criteria  were:  (1)  the  index  was  easily  computable;  (2)  the 
index  had  been  used  as  a  feature  in  other  CAD  schemes;  and 
(3)  the  relationship  between  the  index  and  the  segmentation 
results  is  “interpretable”  and  has  been  demonstrated  in  pre¬ 
vious  studies.  Three  indices  were  selected  empirically  for 
this  study.  The  first  is  the  size  of  the  suspected  region  mul- 
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Table  I.  The  number  of  false-positive  regions  in  the  training  data  set  seg¬ 
mented  by  each  of  the  indices  into  the  “easy,”  “moderately  difficult,”  and 
“difficult”  groups,  respectively. 


Segmentation  index 

“Easy” 

“Moderately  difficult” 

“Difficult” 

SizeXcontrast 

454 

1002 

2334 

Conspicuity 

227 

741 

2822 

Circularity 

366 

849 

2575 

tiplied  by  its  digital  value  contrast.  This  index  could  be  in¬ 
terpreted  to  represent  the  “volume”  of  a  suspicious  mass. 
Studies  have  indicated  that  suspicious  mass  regions  with 
large  size  and  high  contrast  are  easier  to  identify  using  CAD 
schemes  than  small  regions  with  lower  contrast.25,34  The  sec¬ 
ond  index  is  region  conspicuity.  This  index  has  been  exten¬ 
sively  investigated  for  the  detection  of  lung  nodules  on  chest 
images.35  Radiologists  typically  achieved  better  diagnostic 
performance  in  detecting  lung  nodules  with  higher  conspicu¬ 
ity  than  those  with  lower  conspicuity.36  A  similar  relationship 
between  CAD  performance  and  conspicuity  of  mass  regions 
has  also  been  demonstrated.37  The  third  index  is  the  region 
circularity,  an  important  feature  in  classifying  suspicious 
mass  regions  in  a  variety  of  CAD  schemes24,38 

Using  each  of  these  indices,  we  divided  suspicious  re¬ 
gions  into  three  groups,  which  were  defined  as  “easy,” 
“moderately  difficult,”  and  “difficult”  regions.  In  order  to 
have  the  same  number  of  true-positive  training  samples  in 
each  of  the  three  groups,  two  segmentation  thresholds  were 
determined  based  on  the  distribution  of  the  feature  values  for 
the  true-positive  regions.  As  a  result,  the  “easy”  group  in¬ 
cluded  198  true-positive  regions,  and  the  other  two  groups 
had  197  true-positive  regions.  The  number  of  false-positive 
regions  that  resulted  from  such  segmentation  is  listed  in 
Table  I.  The  same  thresholds  were  applied  later  to  the  testing 
database. 

C.  GA  optimization 

In  each  group,  a  different  classifier  was  used  on  the  cases 
with  similar  characteristics.  To  search  for  an  optimal  set  of 
features  to  apply  to  each  group,  a  genetic  algorithm  (GA) 
was  used.  The  binary  coding  method  was  applied  to  create  a 
chromosome  used  in  the  GA.  Each  extracted  feature  corre¬ 
sponded  to  a  gene.  To  decide  the  number  of  hidden  neurons 
in  the  second  (hidden)  layer  of  the  ANN,  we  added  four 
genes  in  the  chromosome.  The  chromosome  had  a  fixed 
length  of  40,  where  the  first  36  genes  represent  extracted 
image  features,  and  the  last  4  genes  indicate  the  number  of 
hidden  neurons.  The  same  GA  software  and  initial  setup  pa¬ 
rameters  have  been  reported  previously.22  In  brief,  the  initial 
population  size  of  chromosomes  was  set  at  100.  The  cross¬ 
over  rate,  the  mutation  rate,  and  the  generation  gap  were  set 
at  0.6,  0.001,  and  1.0,  respectively. 

A  training  sample  of  equal  number  of  true-positive  and 
false-positive  regions  was  then  used  to  train  the  weights  con¬ 
necting  the  neurons  in  the  ANN.  To  minimize  the  over-fitting 
and  keep  the  robustness  of  ANN  performance  when  applied 
to  new  cases,  a  limited  number  of  training  iterations  as  well 


as  a  large  ratio  between  the  momentum  and  learning  rate  was 
adopted.24,39  The  number  of  training  iterations  of  the  ANN 
was  fixed  at  1000,  while  the  momentum  and  learning  rate  in 
the  ANN  training  were  set  up  as  0.8  and  0.01,  respectively. 
ROC  curves  generated  from  the  training  samples  (Az  values 
computed  by  the  program  rocfit)40  were  used  as  a  fitness 
function  (or  criterion)  in  the  GA  optimization.  The  chromo¬ 
somes  that  produced  higher  Az  values  had  higher  probabili¬ 
ties  of  being  selected  in  generating  new  chromosomes  for  the 
next  generation  using  the  methods  of  crossover  and  muta¬ 
tion.  The  GA  was  terminated  when  it  converged  to  the  high¬ 
est  A.  value  or  reached  a  predetermined  number  of  genera¬ 
tions  (i.e.,  100).  The  resulting  set  of  features  was  assumed  to 
be  “optimal”  and  was  implemented  in  the  CAD  scheme. 

D.  Adaptive  and  nonadaptive  optimization 

In  this  study  we  compared  the  performance  changes  of 
detection  accuracy  between  the  ANNs  when  optimized  adap¬ 
tively  versus  nonadaptively.  In  the  adaptive  optimization 
method,  the  training  database  was  first  segmented  into  three 
subsets  with  a  “similar”  characteristic.  ANNs  with  different 
topologies  and  input  features  were  then  optimized  separately 
using  the  GA  method  for  each  subset.  To  train  an  ANN,  all 
true-positive  regions  in  the  subset  were  used,  and  the  same 
number  of  false-positive  regions  was  also  randomly  selected 
from  the  larger  dataset  of  false-positive  regions  in  that  group. 
Using  the  GA  method  an  ANN  was  optimized  specifically  for 
this  subset.  Since  three  segmentation  indices  (size  X  contrast, 
conspicuity,  and  circularity)  were  used  in  this  experiment,  a 
total  of  nine  subsets,  hence  ANNs  were  established  (three 
subsets  for  each  segmentation  index  and  three  indices  of  seg¬ 
mentation). 

In  the  nonadaptive  optimization,  the  cases  were  not  seg¬ 
mented  into  subsets.  Because  the  number  of  training  samples 
could  affect  performance,24  we  used  the  GA  method  to  opti¬ 
mize  the  ANN  once  with  198  randomly  selected  true-positive 
and  198  false-positive  regions  (ANN-1),  then  we  repeated 
the  procedure  including  all  592  true-positive  regions  in  the 
training  database  and  a  randomly  selected  set  of  592  false¬ 
positive  regions  (ANN-2). 

After  optimization,  an  independent  database,  which  in¬ 
cludes  358  masses  and  3265  regions  that  had  been  identified 
as  suspicious,  but  were  actually  negative,  was  used  to  evalu¬ 
ate  and  compare  the  performance  of  the  adaptive  and  non¬ 
adaptive  ANNs.  To  test  the  adaptive  scheme,  the  program 
first  segmented  the  database  into  subsets  using  the  same  in¬ 
dices  developed  for  the  training  phase.  The  ANN  results  for 
all  regions  in  the  testing  database  were  used  to  compute  the 
area  under  ROC  curves  (Az  values)  using  the  ROCFIT  pro¬ 
gram. 

E.  Performance  gain  by  averaging  scores 

Averaging  ratings  cases  from  different  independent  read¬ 
ings  could  improve  the  diagnostic  accuracy.41  Accuracy 
gains  are  strongly  dependent  on  the  number  of  observations 
(or  schemes)  and  the  correlation  between  observations.  For 
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Table  II.  Correlation  coefficients  between  cases  assigned  to  different  Table  III.  The  number  of  true-  and  false-positive  regions  assigned  to  the 

groups  using  the  segmentation  rules  based  on  the  three  features  (size  different  groups  using  the  three  segmentation  indices  when  applied  to  the 

Xcontrast,  conspicuity,  and  circularity).  testing  database. 


Indices 

compared 

TP  regions 
in  training 
database 

FP  regions 
in  training 
database 

TP  regions 
in  testing 
database 

FP  regions 
in  testing 
database 

Segmentation 

index 

Group  1 
true/false 
positives 

Group  2 
true/false 
positives 

Group  3 
true/false 
positives 

ANN-1  to  ANN-2 

0.148 

0.209 

SizeXcontrast 

120/514 

123/893 

115/1890 

ANN-1  to  ANN-3 

-0.004 

Conspicuity 

113/182 

116/612 

129/2503 

ANN-2  to  ANN-3 

0.005 

Circularity 

106/290 

107/791 

145/2216 

example,  by  averaging  the  results  from  three  observations, 
accuracy  gains  could  range  from  0  and  73.2%  when  the  cor¬ 
relations  range  from  1  to  0.41 

Similar  to  the  multireader  problem,  we  segmented  the 
data  set  three  times  using  each  of  the  three  segmentation 
features  (size Xcontrast,  conspicuity,  and  circularity).  Each 
segmentation  resulted  in  three  subsets  of  cases.  Note  that  a 
case  segmented  into  group  one  (“easy”)  based  on  one  fea¬ 
ture  (e.g.,  circularity)  may  be  classified  into  group  three 
(“difficult”)  based  on  another  feature  (e.g.,  conspicuity). 
Each  suspicious  region  was  assigned  to  a  specific  category 
using  each  segmentation  index,  and  the  “optimal”  ANN  for 
that  subset  was  applied  by  assigning  a  likelihood  score. 
Hence,  each  region  was  assigned  three  different  scores  re¬ 
lated  to  its  likelihood  for  depicting  a  true  mass.  These  scores 
were  averaged  and  a  “combined”  ROC  curve  was  generated. 
Results  were  compared  to  those  obtained  using  individual 
scores.  In  addition,  we  compared  experimentally  measured 
and  expected  gains  due  to  averaging  based  on  measured  cor¬ 
relations 

(  co  v(x,ry 

YX,r  ux°r  l 

where  COV(A,  Y)  is  the  covariance  of  two  vectors  X  and  Y, 
and  ax  and  <xr  are  the  standard  deviations  of  the  vectors, 
respectively.42  The  theoretical  expected  gains  were  computed 
for  the  averaging  of  multiple  observations  41 

III.  RESULTS 

Table  I  summarizes  the  number  of  false-positive  regions 
assigned  to  each  group  when  different  features  were  used  for 
segmentation  in  the  training  data  set.  Noted  is  the  large  num¬ 
ber  of  regions  assigned  to  the  last  “difficult”  group.  In  gen¬ 
eral,  this  indicates  that  many  of  the  false-positive  regions 
were  not  “easy”  to  rule  out  as  a  true  mass.  The  correlation 
coefficients  between  the  classification  assignment  of  regions 
based  on  the  segmentation  performed  using  the  three  features 
are  summarized  in  Table  II.  The  low  correlations  indicate 
that  a  large  number  of  regions  in  each  database  were  seg¬ 
mented  into  different  groups  when  different  features  were 
used  for  segmentation.  Only  12.5%  of  the  true-positive  re¬ 
gions  and  25.2%  of  the  false-positive  regions  in  the  training 
database  were  consistently  assigned  to  the  same  group  (e.g., 
easy).  As  a  result,  for  the  same  training  database,  three  sets 
of  adaptive  ANNs  were  actually  trained  with  different  cases 
for  each  group.  When  ANN  scores  from  randomly  selected 


groups  with  the  same  number  of  cases  are  compared,  the 
correlation  coefficients  range  from  0.712  to  0.963.  These  re¬ 
sults  clearly  demonstrate  that  additional  information  could  be 
obtained  from  the  adaptive  approach. 

Table  III  provides  the  distribution  of  regions  segmented 
into  the  different  groups  using  the  three  segmentation  indices 
in  the  testing  database.  While  the  percentage  of  large 
sizeXcontrast  regions  (“easy”  regions)  is  somewhat  higher 
than  that  assigned  to  this  group  in  the  training  database,  the 
general  distributions  are  quite  similar.  The  optimization  pro¬ 
cess  resulted  in  ANNs  that  included  different  input  features 
and  varying  numbers  of  hidden  neurons.  The  number  of  in¬ 
put  features  ranged  from  9  to  15  and  the  number  of  hidden 
neurons  ranged  from  3  to  7.  Table  IV  provides  the  results 
( Az )  for  the  different  schemes  when  applied  to  the  testing 
database  and  a  comparison  (P  values)  to  the  nonadaptive 
scheme  using  198  positive  and  198  negative  regions  for 
training  (ANN-1).  The  approach  in  ANN-2  is  similar  to 
ANN-1,  only  592  positive  and  592  negative  regions  were 
used  for  training  purposes.  Both  ANN-1  and  ANN-2  are  non¬ 
adaptive  schemes,  and  the  significant  improvement  ( P 
=  0.03)  in  ANN-2  is  largely  the  result  of  more  complete 
feature  domain  coverage.  Adaptive  schemes  1-3  are  the  re¬ 
sults  after  optimization  by  segmentation  based  on  individual 
indices.  For  example,  scheme  1  was  trained  using  the  subsets 
of  sizeXcontrast  as  a  segmentation  index.  As  can  be  seen,  the 
results  are  somewhat  better  (albeit,  not  significantly)  than  the 
nonadaptive  scheme  using  198  positive  and  198  negative  re¬ 
gions  (ANN-1),  but  these  are  not  improved  compared  with 
ANN-2.  On  the  other  hand,  by  averaging  detection  scores  of 
the  different  adaptive  schemes  (either  two  or  all  three),  sig- 


Table  TV.  Areas  under  ROC  curves  (A.  values)  for  different  schemes  and 
their  comparisons  (two-tailed  p  values)  with  the  nonadaptive  scheme  using 
198  positive  and  198  negative  regions  (ANN-1). 


Scheme 

AS 

P 

Nonadaptive  ANN-1 

0.82 

Nonadaptive  ANN -2 

0.85 

0.03 

Adaptive- 1 

0.84 

0.18 

Adaptive-2 

0.83 

0.63 

Adaptive-3 

0.84 

0.21 

Average  (1+2) 

0.91 

<0.01 

Average  (1  +  3) 

0.92 

<0.01 

Average  (2  +  3) 

0.91 

<0.01 

Average  (1+2  +  3) 

0.95 

<0.01 

“Standard  deviation  for  all  A.  values  is  0.01. 
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Fraction  of  false-positive  detection 


Fig.  1 .  ROC  curves  from  nonadaptive  ANN-1  and  three 
sets  of  noncombined  adaptive  ANNs.  The  Az  values  for 
these  curves  are  0.82,  0.84,  0.83,  and  0.84,  respectively. 


nificant  gains  in  detection  accuracy  (p<0.01)  are  achieved. 
Averaging  results  from  two  or  three  adaptive  schemes  re¬ 
sulted  in  a  much  larger  performance  gain  (PcO.Ol)  in  the 
testing  database  as  compared  with  ANN-2.  Figures  1  and  2 
demonstrate  the  ROC  curves  for  several  different  classifica¬ 
tion  schemes. 

To  verify  the  theoretical  feasibility  of  obtaining  the  per¬ 
formance  gains  observed  in  this  study,  we  used  the  correla¬ 
tions  for  the  test  results  from  the  different  adaptive  schemes 
(Table  V)  in  the  estimation  method  proposed  by  Swensson 
et  al.*'  to  compute  expected  improvements  by  averaging 
these  schemes.  Table  VI  summarizes  the  predicted  Z  values 
and  percentage  gain  in  accuracy  by  averaging  scores  of  two 
or  three  adaptive  schemes.  Predicted  Az  values  using  a  gen¬ 
eral  binormal  model  are  also  provided.  These  are  consistent 
with  the  experimental  results  we  computed  directly  using 
ROCFIT. 


IV.  DISCUSSION 

Averaging  diagnostic  ratings  from  different  readers41  or 
scores  from  different  machine  learning  classifiers17,21  might 
significantly  improve  detection  accuracy,  if  the  ratings  or 
scores  from  different  observations  have  low  correlations. 
ANN  is  one  of  the  most  commonly  used  machine  learning 
classifiers  in  CAD  developments,  due  to  its  ability  to  leant 
complex  patterns  directly  from  training  samples  with  mini¬ 
mal  requirement  on  prior  knowledge  of  the  input  features  or 
internal  system  operation.43  In  this  study,  we  explored  a 
simple  and  novel  method  to  segment  and  optimally  train  sets 
of  adaptive  ANNs.  Since  these  produced  extremely  low  cor¬ 
related  classification  results  using  a  large  and  independent 
testing  database,  significant  gains  were  realized  by  averaging 
the  scores  from  the  different  ANNs. 

Given  the  large  number  of  independent  variables  that  are 


Fig.  2.  ROC  curves  of  classification  results  from  non¬ 
adaptive  schemes  (ANN-1  and  ANN-2)  as  well  as  after 
averaging  scores  of  three  sets  of  adaptive  ANNs.  The 
Az  values  are  0.82±0.01,  0.85±0.01,  and  0.95±0.01, 
respectively. 
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Table  V.  Correlation  coefficients  between  testing  results  using  adaptive 
ANN  scores  from  different  schemes 


Between  adaptive 
schemes 

TP  regions  [p(u)] 

FP  regions  [p(n)] 

Between  A. 

ANN-1  to  ANN-2 

0.018 

-0.004 

0.013 

ANN-1  to  ANN-3 

-0,011 

0.003 

-0.007 

ANN-2  to  ANN-3 

0.116 

0.011 

0.086 

needed  to  characterize  masses  and  normal  tissue  structure  on 
digitized  mammograms  and  the  fact  that  many  of  the  features 
are  continuous  and  span  a  wide  range  of  values,  a  large  and 
carefully  selected  training  data  set  is  required  to  ensure  ad¬ 
equate  domain  coverage  that  could  result  in  robust 
performance.24  Finding  an  optimal  feature  set  from  a  limited 
image  database  is  an  important  factor  in  determining  the  per¬ 
formance  and  robustness  of  CAD  schemes.44,45  Had  it  been 
possible  to  extract  an  “ideal”  (or  fully  optimized)  set  of  fea¬ 
tures  that  adequately  covers  the  variables’  domain  from  a 
limited  data  set,  it  may  not  be  necessary  to  perform  the  adap¬ 
tive  filtering  and  score  averaging  procedures  described  here. 
Using  different  training  samples  to  optimize  ANNs  could 
result  in  different  topologies  (similar  to  using  different  input 
features  or  having  different  numbers  of  hidden  neurons). 
However,  our  experiments  showed  that  generally  the  corre¬ 
lations  of  the  detection  results  when  applying  these  ANNs  to 
an  independent  testing  database  were  quite  high  (p^ 0.7). 

In  order  to  take  advantage  of  possible  improvement  in 
performance  due  to  score  averaging,  one  should  train  differ¬ 
ent  ANNs  using  the  samples  with  different  characteristics. 
The  adaptive  concept  reported  in  previous  CAD  studies26,27 
was  used  here  to  group  images  with  similar  characteristics. 
The  three  segmentation  indices  reported  in  this  study  re¬ 
sulted  in  87%  of  true-positive  and  74%  of  false-positive  re¬ 
gions  being  classified  in  different  groups.  Hence,  the  ANNs 
for  the  “same”  group  (e.g.,“easy”  group)  were  trained  using 
different  images  in  each  of  the  subsets  segmented  based  on 
values  from  one  of  the  three  features.  As  a  result,  the  classi¬ 
fication  scores  generated  by  these  three  ANNs  had  low  cor¬ 
relations.  Similar  to  averaging  ratings  from  independent 
observers,28,29,41  averaging  the  scores  from  these  “indepen¬ 
dent”  ANNs  yielded  significant  performance  gains. 

Although  quite  encouraging,  the  results  presented  here  are 
preliminary  and  have  to  be  validated  in  larger  independent 
databases.  We  explored  here  only  three  simple  and  com- 


Table  VI.  The  predicted  performance  gain  of  averaging  scores  from  the 
three  adaptive  schemes  using  the  methodology  proposed  by  Swensson  et  at. 
(Ref.  41). 


Averaging  Predicted 

adaptive  schemes  Z  (average) 

Percentage  gain 
in 

Z  value 

Predicted  Az 

Measured  Az 

1+2 

1.374 

48.2 

0.92 

0.91  ±0.01 

1  +  3 

1.420 

53.1 

0.92 

0.92  +  0.01 

2  +  3 

1.338 

44.3 

0.91 

0.91  ±0.01 

1+2  +  3 

1.644 

77.3 

0.95 

0.95  ±0.01 

monly  used  features  for  segmentation  purposes.  Other  fea¬ 
tures,  including  those  extracted  locally  (from  a  suspicious 
region)  and  globally  (from  a  full  image),  should  be  explored 
as  well.  However,  based  on  the  results  of  this  preliminary 
experiment,  we  believe  that  the  approach  taken  may  have 
significant  advantages  over  a  multifeature,  single  ANN  ap¬ 
proach  to  the  problem. 
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ABSTRACT 

As  the  number  of  mammographic  examinations  increases,  it  becomes  clear  that  in  many  underserved  locations,  there  is  a 
lack  of  expertise  that  is  required  for  consistent,  highly  accurate,  and  timely  diagnosis.  Hence,  mammograms  are 
frequently  sent  to  other  medical  facilities,  and  a  significant  fraction  of  women  (typically  3-10%)  are  recalled  for 
additional  examinations.  It  is  the  purpose  of  this  project  to  develop,  test,  and  clinically  evaluate  a  telemammography 
system  that  will  operate  between  several  remote  locations  and  a  large  breast  cancer  center.  In  this  manuscript  we 
describe  the  design  considerations,  implementation,  and  initial  testing  that  were  undertaken,  to  date.  The  system 
digitizes  a  mammogram  at  50  pm  pixel  size,  compresses  the  resulting  image  file  (~75:1),  and  transmits  it  over  a 
telephone  line  to  the  central  site  where  the  data  received  are  decompressed  and  displayed  on  a  high-resolution 
workstation  in  approximately  4  minutes  per  image.  Initial  testing  of  the  system  indicates  that  a  relatively  inexpensive 
system  for  “almost  real-time”  telemammography  can  be  employed  in  any  geographic  area  that  possesses  standard 
telephone  lines,  and  this  approach  to  enhance  communication  may  make  it  possible  to  offer  better  mammographic 
services  at  remote  locations. 

Key  Words:  Imaging,  Teleradiology,  Mammography,  Data  compression,  Image  display 


1.  INTRODUCTION 

Periodic  mass  screening  of  asymptomatic  women  is  rapidly  gaining  approval  and  acceptance,  and  the  population 
segment  recommended  for  screening  is  increasing  due  to  both  longer  life  expectancy  as  well  as  earlier  recommended  age 
for  initial  examination  [1-3].  The  large  variability  in  a  number  of  important  aspects  related  to  mammography,  as 
practiced  in  the  U.S.,  resulted  in  the  enactment  of  the  Mammography  Quality  Standards  Act,  which  mandates 
accreditation  of  each  program  (facility,  technical  and  professional)  [4,5].  Shortages  of  expert  mammographers  in  many 
locations,  combined  with  the  desire  to  make  it  convenient  for  the  patient  to  undergo  the  procedure,  suggest  that  there 
may  be  a  need  for  high-quality  telemammography  systems  that  enable  a  distributed  acquisition-centralized  expert  review 
type  solution  to  the  problem  [6,7].  The  relatively  high  recall  rates  (5-15%)  of  screened  women  to  supplement 
information  that  was  not  ascertained  during  the  initial  visit  (e.g.  magnification  views)  also  make  it  desirable  to  enable 
physician  “monitoring”  and  “management”  of  remote  locations  so  that  clinical  and  diagnostic  decisions  can  be  made 
while  the  patient  remains  in  the  clinic  [8-11].  Early  attempts  to  develop  and  implement  a  practical  telemammography 
solution  to  this  problem  failed  due  to  several  significant  technical  problems  associated  with  acquisition,  transmission, 
management,  and  display  of  the  images  [12-14],  Many  of  these  technical  issues  have  been  resolved  in  recent  years,  but 
some  remain  [14-18].  Although  an  adequate  communication  infrastructure  for  high-quality  telemammography  is 
available  within  some  urban  regions,  the  fact  remains  that  where  it  may  be  needed  most  (i.e.  remote,  non-urban 
locations),  enabling  (two-way)  communication  systems  are  limited  mainly  to  the  Plain  Old  Telephone  System  (POTS). 
Other  communication  technologies,  such  as  satellites,  are  being  evaluated  for  this  purpose,  but  it  is  not  likely  that  these 
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will  displace  POTS  in  most  underserved  areas  for  quite  some  time  [19-21].  Hence,  the  problem  of  cost  effective,  timely 
remote  patient  monitoring  and  management  in  many  underserved  areas  is  not  a  simple  one.  Using  a  unique  data-handling 
scheme,  we  have  been  able  to  demonstrate  that  high-quality,  multi-site  telemammography  systems  can  be  developed 
under  these  acquisition  and  communication  constraints  [22,23].  Using  similar  concepts,  we  have  been  developing  a 
multi-site  system  that  enables  “almost  real-time  communications”  between  the  “spokes  and  the  hub.”  Design 
considerations  as  well  as  implementations  and  initial  testing  procedures  are  described  in  the  manuscript. 

2.  METHOD 

At  the  remote  sites,  we  use  a  high  resolution  Lumiscan  85  film  digitizer  (Eastman  Kodak,  Rochester,  NY)  connected  via 
SCSI  to  a  Windows  NT  2000  PC  (900  MHz  Athlon  512  MB)  running  multi-threaded  software.  The  digitizer  is  equipped 
with  a  film  feeder  and  is  capable  of  digitizing  up  to  six  films  in  a  batch  at  50  pm  pixel  size  over  optical  densities  ranging 
from  0  to  4.0  OD.  Four  slots  of  the  film  feeder  are  labeled  for  specific  mammographic  views  (i.e.  LCC,  RCC,  RMLO 
and  LMLO)  for  ease  of  use  during  the  digitization  process.  The  user  at  the  remote  site  (typically  a  technologist)  selects 
either  an  option  to  digitize  a  “standard”  protocol  for  an  image  set  or  any  of  the  six  films  he/she  chooses  to  send,  by 
clicking  on  an  appropriate  icon. 

The  user  enters  patient  information  into  a  computer  data  entry  form  during  the  digitization.  At  this  time  he/she  also 
enters  information  for  ‘non-standard’  cases  by  choosing  from  drop-down  menus  the  anatomy  and  view  for  each  of  the 
films  being  digitized.  Meanwhile  the  software  on  the  PC  establishes  a  connection  with  the  central  hub  if  a  connection 
does  not  already  exist.  This  is  currently  done  via  dial-up  phone  line  or  an  Internet  connection,  but  optionally  ISDN  or 
DSL  can  be  used  as  well.  For  the  dial-up  connection,  internal  56K  hardware  modems  (U.S.  Robotics,  Rolling  Meadows, 
IL)  are  used.  The  image  data  are  processed  in  sections,  segmented,  and  compressed  using  JPEG  2000  compatible 
irreversible  wavelet  compression  and  transmitted  in  packets  to  the  central  site.  Optionally,  a  report  or  patient  history  can 
be  transmitted  along  with  the  images  by  inserting  them  into  an  attached  page  scanner  (OneTouch  8650,  Visioneer,  Inc., 
Fremont,  CA). 

The  central  site  has  a  Windows  2000  Server  workstation  (Dual  1.2  GHz  Athlon  MP,  2  GB  RAM)  running  specially 
developed  software.  Data  received  from  remote  sites  is  reconstructed  from  the  packets,  decompressed,  and  stored  on  a 
hard  disk  and/or  in  memory  (if  available).  Several  cases  (depending  on  size)  can  be  stored  in  memory  for  instant  access. 
Cases  stored  on  disk  take  a  few  seconds  to  restore  to  memory.  The  display  consists  of  a  pair  of  high-resolution  (2048  x 
2560)  8-bit  grayscale  portrait  monitors  at  a  nominal  setting  of  80  ftL  (DS5100P,  Clinton  Electronics,  Rockford,  IL).  The 
bottom  of  the  displays  holds  a  bar  of  icons  and  arrows  for  selecting  cases,  images,  and  other  tools.  The  user  can  select 
from  a  patient  list  that  displays  the  unreviewed  cases  on  the  top  (similar  to  a  “worklist”).  When  a  case  is  selected,  four 
images  appear  in  quadrants  on  the  right  monitor.  The  left  monitor  displays  the  currently  selected  image  (the  first  image 
by  default)  at  half  the  available  resolution.  Although  images  are  displayed  at  window  and  level  settings  determined  by 
the  statistics  of  the  signal  from  individual  image  data  sets,  the  user  may  select  the  window  and  level  tool  and  alter  it  in 
real  time  using  a  mouse.  A  “magnify”  tool  is  also  available  that  magnifies  any  square  region  under  the  cursor  in  real¬ 
time  to  full  resolution  as  it  is  moved  over  the  image.  Among  other  tools  on  the  tool  bar  are  arrows  that  allow  movement 
to  the  next  or  preceding  case. 

We  plan  to  add  DICOM  compatibility  to  the  workstation  at  the  central  site.  This  will  include  the  capability  to  send  and 
print  selected  images  to  a  mammographic  film  printer  (DryView  8610,  Eastman  Kodak,  Rochester,  NY).  This  will  also 
allow  transferring  workstation  images  to  another  DICOM  device  (workstation  or  storage)  and  also  allow  access  to 
images  from  other  DICOM  compatible  devices,  such  as  full  field  digital  mammography  acquisition  systems  [24,25]. 

We  also  plan  to  add  computer-aided  detection  (CAD)  software  at  the  remote  site.  This  would  allow  image  analysis  to  be 
performed  on  the  original  images  during  the  time  the  compressed  data  sets  are  transmitted.  The  results  can  be  sent 
immediately  after  the  image  data  transfer,  simply  as  coordinate  data.  Suspicious  areas  for  masses  and  microcalcifications 
would  then  be  marked  on  a  removable  overlay  on  the  images  at  the  hub.  Figure  1  is  a  schematic  diagram  of  the  system  as 
it  is  currently  configured  and  being  evaluated. 
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Figure  1:  Schematic  of  the  Telemammography  System 


3.  SOFTWARE  DESIGN 


Both  the  hub  site  and  remote  site  computer  programs  are  designed  using  multithreading  to  permit  each  task  to  be 
completed  in  a  timely  manner;  yet,  allow  the  system  to  be  responsive  to  user  input.  The  main  threads  communicate  with 
one  other  by  sending  thread  messages  to  other  threads.  Each  main  thread  handles  the  messages  that  are  applicable  to  it 
and  ignores  any  others.  A  main  thread  may  spawn  another  thread  to  accomplish  some  subordinate  task.  These  spawned 
threads  do  not  receive  messages,  but  they  do  send  messages. 

The  main  threads  for  the  hub  site  program  are: 

Archive  Manager  that  handles  saving  and  loading  of  images  and  cases  and  the  deletion  of  uncompressed  images  when 
free  disk  space  becomes  low. 

The  Case  Manager  handles  the  functions  of  creating  images  and  cases,  in  addition  to  most  of  the  database  functions. 

The  Display  Manager  controls  the  display  of  images  and  forwards  messages  to  the  main  application  window. 

The  Distribution  Manager  handles  the  receipt  and  transmission  of  data  and  the  processing  (including  decompression)  of 
the  data. 

The  main  threads  for  a  remote  site  are: 

Digitizer  Manager  that  handles  all  the  tasks  related  to  the  film  digitization. 

The  Case  Manager  handles  the  functions  of  creating  images  and  cases,  in  addition  to  most  of  the  database  functions. 

The  Display  Manager  controls  the  display  of  images  and  forwards  messages  to  the  main  application  window. 

The  Distribution  Manager  handles  the  transmission  and  receipt  of  data  and  the  processing  (including  compression)  of  the 
data. 

The  threads  for  the  most  part  are  synchronized  using  a  Reader  /  Writer  lock  that  is  a  combination  of  the  built-in 
Microsoft  Windows  synchronization  primitives.  This  lock  allows  either  any  number  of  readers  or  just  one  writer  to  have 
access  to  a  shared  object.  This  allows  greater  concurrency  than  that  which  could  be  achieved  by  using  a  Mutex,  which 
allows  only  a  single  thread  to  access  an  object  at  a  time  forcing  all  other  threads  to  wait. 

4.  USER  FUNCTIONALITY 

At  the  remote  sites  all  data  entry  functions  utilize  pull-down  menus  supported  by  the  use  of  a  keyboard.  A  “start” 
command  enables  digitization  of  a  case,  and  data  entry  can  be  performed  within  a  predetermined  time  slot  during  the 
digitization  process.  At  the  central  site,  a  high-resolution  workstation  is  operated  solely  using  a  mouse,  and  several 
simple  options  are  available  by  clicking  on  the  appropriate  button  (e.g.  flip,  magnify,  rotate,  display  on  other  monitor, 
etc.).  The  cases  in  memory  and  those  on  disk  are  so  indicated  on  patient  lists,  and  automatic  lookup  tables  (image- 
statistic  based)  are  used  to  display  “reasonable”  default  settings. 

5.  RESULTS 

The  system  has  been  designed,  assembled  and  tested  for  technical  reliability.  Currently  the  three  sites  (See  Figure  1)  are 
located  anywhere  from  15-90  miles  away  from  our  hub  in  Pittsburgh.  The  remote  sites  are  all  outpatient  clinics,  which 
are  staffed  by  a  physician  between  one  day  a  week  to  half  a  day  every  two  weeks.  Cases  from  multiple  sites  have  been 
transmitted  simultaneously  and  received  successfully  at  the  hub.  Average  transmission  times  for  a  four-image  case  vary 
significantly  based  on  bandwidth  availability  and  film  size  and  currently  ranges  from  9  to  25  minutes.  We  are  currently 
evaluating  different  approaches  to  reduce  the  cycle  time  to  below  15  minutes  per  case  as  an  upper  limit.  To  date  we 
have  received  over  200  cases  from  the  remote  sites,  and  we  are  analyzing  user  functionality  at  all  locations. 

Two  mammographers  performed  an  initial  evaluation  of  a  series  of  cases  and  the  basic  workstation’s  basic  functionality. 
The  quality  of  the  images  received  was  subjectively  judged  to  be  acceptable  or  better.  A  series  of  retrospective  analyses 
on  a  large  number  of  cases  sent  from  all  sites  will  follow. 
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6.  DISCUSSION 


Low  cost  telemammography  is  becoming  feasible  as  communication  technology  and  processing  capabilities  continue  to 
improve  in  terms  of  cost,  availability,  and  reliability.  The  system  we  designed  is  capable  of  variable  compression  rates, 
should  it  be  desired,  as  well  as  the  ability  to  print  images  at  the  receiving  site.  As  important,  the  incorporation  of  a  CAD 
scheme  into  the  protocol  may  aid  in  decision  making  at  both  the  sending  (remote)  sites  as  well  as  the  receiving  site.  It 
should  be  noted  that  the  system  was  not  designed  for  electronic  primary  diagnosis,  but  rather  to  facilitate  better 
communication  between  remote  (and  perhaps  underserved)  sites  and  a  central  hub  where  expertise  is  more  readily 
available. 

Our  initial  assessment  indicates  that  technically  our  objectives  can  be  met,  and  we  hope  that  our  planned  clinical 
evaluations  will  improve  our  understanding  as  to  whether  or  not  such  systems  can  be  used  to  enhance  communication, 
aid  in  timely  decision  making,  help  reduce  recall  rates,  and  ultimately  enhance  and  improve  the  timeliness  and  quality  of 
the  service  we  can  provide  in  locations  where  expert  mammographers  are  not  physically  present  at  the  time  of  the 
examination. 
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APPENDIX  4 


Original  Investigations 


Computer-aided  Detection  in  Mammography: 

An  Assessment  of  Performance  on  Current  and  Prior  Images1 


Bin  Zheng,  PhD,  Ratan  Shah,  MD,  Luisa  Wallace,  MD,  Christiane  Hakim,  MD,  Marie  A.  Ganott,  MD,  David  Gur,  ScD 


Rationale  and  Objectives.  The  authors  assessed  and  compared  the  performance  of  a  computer-aided  detection  (CAD) 
scheme  for  the  detection  of  masses  and  microcalcification  clusters  on  a  set  of  images  collected  from  two  consecutive 
(“current”  and  “prior”)  mammographic  examinations. 

Materials  and  Methods.  A  previously  developed  CAD  scheme  was  used  to  assess  two  consecutive  screening  mammo¬ 
grams  from  200  cases  in  which  the  current  mammogram  showed  a  mass  or  cluster  of  microcalcifications  that  resulted  in 
breast  biopsy.  The  latest  prior  examinations  had  been  initially  interpreted  as  negative  or  definitely  benign  findings  (Breast 
Imaging  Reporting  and  Data  System  rating,  1  or  2).  The  study  involved  images  of  400  examinations  acquired  in  200  pa¬ 
tients.  Radiologists  identified  172  masses  and  128  clusters  of  microcalcifications  on  the  current  images.  The  performance 
of  the  CAD  scheme  was  analyzed  and  compared  for  the  current  and  latest  prior  images. 

Results.  There  were  significant  differences  ( P  <  .01)  between  current  and  prior  images  in  many  feature  values.  The  per¬ 
formance  of  the  CAD  scheme  was  significantly  lower  for  prior  than  for  cun-ent  images  (P  <  .01).  At  0.5  and  0.2  false¬ 
positive  mass  and  cluster  cues  per  image,  the  scheme  detected  78  malignant  masses  (78%)  and  63  malignant  clusters 
(80%)  on  current  images.  Only  42%  of  malignant  cases  were  detected  on  prior  images,  including  40  masses  (40%)  and  36 
microcalcification  clusters  (46%). 

Conclusion.  CAD  schemes  can  detect  a  substantial  fraction  of  masses  and  microcalcification  clusters  depicted  on  prior 
images.  To  improve  performance  with  prior  images,  the  scheme  may  have  to  be  adaptively  reoptimized  with  increasingly 
more  subtle  abnormalities. 

Key  Words.  Breast,  calcification;  breast  neoplasms,  diagnosis;  breast  radiography;  computers,  diagnostic  aid. 

6  AUR,  2002 


Breast  cancer  is  a  common  cancer  in  women  over  the  age 
of  40  years  (1).  Early  detection  is  believed  to  be  impor¬ 
tant  for  improved  prognosis  and  therapy  and  for  reducing 
associated  mortality  and  morbidity  (2).  Mammography  is 
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a  well-established  and  accepted  method  for  screening  the 
general  population.  Current  guidelines  in  the  United 
States  recommend  periodic  mammographic  screening  for 
women  aged  40  years  or  older  (3).  Because  of  the  large 
volumes,  low  expected  detection  rate  of  abnormalities  in 
screening  examinations,  and  the  complexity  of  tissue  pat¬ 
terns  depicted  on  a  large  fraction  of  images,  it  is  both 
difficult  and  time  consuming  to  interpret  mammographic 
cases  (4).  Independent  double  reading  is  a  well-docu¬ 
mented  method  to  improve  early  detection  of  breast  can¬ 
cer  (5,6),  but  this  approach  is  often  not  practical  due  to 
personnel  and  logistic  constraints  (7). 

After  extensive  investigations  and  development  efforts 
for  more  than  a  decade,  computer-aided  detection  (CAD) 
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systems  have  been  accepted  as  clinical  tools  that  provide 
radiologists  with  a  useful  “second  opinion.”  Three  CAD 
systems,  ImageChecker  (R2  Technology,  Los  Altos, 
Calif),  Second  Look  (CADx  Medical  Systems,  Quebec, 
Canada),  and  MammoReader  (Intelligent  Systems  Soft¬ 
ware,  Clearwater,  Fla)  have  been  approved  to  date  by  the 
U.S.  Food  and  Drug  Administration  for  this  purpose. 
Their  performance  has  been  evaluated  (S— 10).  While  in 
general  the  systems  have  been  shown  to  increase  sensitiv¬ 
ity,  these  results  are  not  universal.  One  study  reported 
that,  with  the  help  of  a  commercially  available  system, 
two  radiologists  detected  19.5%  more  cancers  with  only  a 
slight  increase  (from  6.5%  to  7.7%)  in  recall  rate  (11). 
Another  study  reported  that  use  of  a  comparable  system 
did  not  affect  the  performance  of  three  radiologists  retro¬ 
spectively  interpreting  a  set  of  mammograms  depicting  59 
breast  cancers  in  280  patients  (no  increase  in  sensitivity 
or  decrease  in  specificity)  (12).  Our  own  preliminary 
study,  in  which  seven  radiologists  interpreted  120  mam- 
mographic  cases  under  five  different  CAD  cueing  condi¬ 
tions,  suggested  that  highly  performing  CAD  schemes  can 
significantly  improve  the  diagnostic  performance  of  radi¬ 
ologists,  while  poorly  performing  schemes  can  adversely 
affect  performance  (13). 

One  objective  of  using  CAD  is  the  potential  to  detect 
breast  cancers  at  an  earlier  stage.  It  is  well  known  that  a 
large  number  of  breast  abnormalities  (ie,  masses  and  mi¬ 
crocalcification  clusters)  are  visible  in  retrospect  on  prior 
mammograms  but  are  not  interpreted  at  the  time  as  highly 
suspicious.  In  one  study,  427  breast  cancer  cases  were 
reviewed,  and  the  abnormality  in  question  was  visible  on 
the  latest  prior  mammograms  in  286  (67%)  (9).  When 
115  of  the  “more  obvious”  cases  (27%  of  the  original  427 
cases)  were  processed  by  a  CAD  system,  89  cancers  (or 
77%)  were  identified  as  suspicious  on  the  prior  mammo¬ 
grams,  with  an  average  of  one  false-positive  cue  per  im¬ 
age  (14).  Commercial  systems  generally  provide  only  a 
binary  outcome  for  each  suspicious  region  (cued  or  not 
cued)  based  on  a  predetermined  (and  undisclosed)  thresh¬ 
old.  Therefore,  the  difference  in  performance  between 
different  groups  of  images  (in  this  case  “current”  and 
“prior”)  can  be  measured  only  at  one  operating  point. 
Hence,  complete  characterization  (eg,  a  free-response  re¬ 
ceiver  operating  characteristic  [FROC]-type  curve)  of  the 
performance  cannot  be  estimated  (8,14). 

In  the  study  reported  here,  we  applied  a  CAD  scheme 
previously  developed  in  our  laboratory  to  a  set  of  200 
selected  cases  with  mammograms  from  two  consecutive 
examinations.  At  the  latest  examination  (current  images), 


at  least  one  suspicious  mass  or  microcalcification  cluster 
was  identified  by  the  interpreting  radiologist,  resulting  in 
breast  biopsy.  For  the  prior  examinations,  all  images  were 
interpreted  as  “negative”  or  “benign  finding.” 


MATERIALS  AND  METHODS 


The  mammographic  cases  used  in  this  study  were  se¬ 
lected  from  biopsy  records  of  two  medical  facilities  in 
Pittsburgh,  Pa.  In  one  facility  we  collected  all  available 
biopsy  cases  performed  in  1 997,  and  in  another  we  ascer¬ 
tained  a  fraction  of  the  biopsy  cases  performed  in  2000. 
First,  we  excluded  cases  for  which  all  the  original  mam¬ 
mograms  from  the  latest  prior  examination  were  not 
available.  Second,  we  excluded  cases  in  which  the  recom¬ 
mendations  for  biopsy  had  not  been  based  on  either  the 
finding  of  mass  or  microcalcification  cluster.  Third,  we 
selected  only  cases  whose  findings  had  been  interpreted 
as  either  negative  or  benign  (Breast  Imaging  Reporting 
and  Data  System  rating  on  the  latest  prior  examination, 
1  or  2). 

From  the  remaining  pool,  200  cases  were  selected  se¬ 
quentially  for  the  study.  Each  case  included  images  ac¬ 
quired  from  two  consecutive  examinations.  In  this  set  of 
200  cases,  the  interval  between  the  current  examination 
(when  the  patient  was  sent  to  biopsy)  and  the  latest  prior 
examination  varied  from  10  to  22  months.  Radiologists 
identified  172  masses  and  128  microcalcification  clusters 
in  this  data  set.  Of  the  172  identified  masses,  164  were 
visible  (in  retrospect)  on  both  views  (craniocaudal  [CC] 
and  mediolateral  oblique  [MLO]),  and  eight  were  visible 
only  on  one  view.  One  hundred  twenty  of  128  microcalci¬ 
fication  clusters  were  visible  on  two  views,  and  eight  on 
only  one.  Hence,  there  were  a  total  of  336  mass  regions 
and  248  cluster  regions  depicted  on  these  mammograms. 
One  hundred  masses  and  79  clusters  were  associated  with 
malignancies.  Two  masses  and  four  clusters  were  visible 
on  only  one  view.  Therefore,  198  mass  regions  and  154 
cluster  regions  depicted  on  the  current  images  were  asso¬ 
ciated  with  malignancy.  Table  1  summarizes  the  distribu¬ 
tions  of  abnormalities  by  type  and  abnormality  in  the 
database.  A  fraction  of  the  masses  and  clusters  were  visi¬ 
ble  on  the  prior  images.  Therefore,  the  corresponding  lo¬ 
cations  of  all  mass  and  cluster  regions  on  prior  images 
were  determined  visually  during  a  side-by-side  inspection 
and  after  differences  in  breast  positioning  and  compres¬ 
sion  were  accounted  for  subjectively. 

All  mammograms  were  digitized  in  our  laboratory  with 
a  laser  film  digitizer  (Eastman  Kodak,  Rochester,  NY) 
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CAD  ON  CURRENT  AND  PRIOR  IMAGES  | 

Table  1 

Distribution  of  Selected  Masses  and  Microcalcification  Clusters 

Type  of  Abnormality 

All  Cases 

Malignant  Cases 

Total 

Visible  on 

2  Views 

Visible  on 

1  View 

Total 

Visible  on 

2  Views 

Visible  on 
1  View 

Mass  only 

153 

145 

8 

83 

81 

2 

Cluster  only 

109 

101 

8 

62 

58 

4 

Mass  and  clusters  combined 

19 

19 

0 

17 

17 

0 

with  a  pixel  size  of  50  X  50  jum  and  12  bits  of  gray  lev¬ 
els.  Each  image  was  then  subsampled  by  a  factor  of  two 
in  both  dimensions  with  a  pixel  averaging  method  to  re¬ 
duce  the  spatial  resolution  to  100  X  100  jam.  Our  previ¬ 
ously  described  CAD  scheme  (15)  was  applied  to  the  im¬ 
ages  to  detect  suspicious  regions  for  microcalcification 
clusters.  Images  were  then  subsampled  again  by  a  factor 
of  four  in  both  dimensions  to  reduce  the  effective  pixel 
size  to  400  X  400  pm,  and  a  “mass”  detection  scheme 
(16)  was  applied. 

The  CAD  scheme  developed  in  our  laboratory  (15-17) 
was  applied  without  modifications  (“as  is”)  to  all  images 
in  the  database.  After  image  segmentation  and  topo¬ 
graphic  multilayer  region  growth  (15,16),  the  scheme  ex¬ 
tracts  a  set  of  image  features  for  each  identified  suspi¬ 
cious  region  and  its  surrounding  tissue  background.  Two 
artificial  neural  networks  (ANNs),  one  for  mass  detection 
and  one  for  microcalcification  cluster  detection,  were  used 
to  classify  each  suspicious  region  by  assigning  it  a  likeli¬ 
hood  score  for  the  abnormality  in  question  (for  the  likeli¬ 
hood  of  being  positive)  (17).  With  these  detection  scores 
used  as  the  input  values  of  an  ROC  curve-fitting  routine 
(18),  performance  curves  were  generated.  After  normal¬ 
ization  for  the  maximum  false-positive  rates,  the  perfor¬ 
mance  results  were  transformed  into  FROC  curves.  FROC 
curves  were  compared  for  the  corresponding  current  and 
prior  image  data  sets. 

False-positive  cueing  rates  are  extremely  important  in 
the  screening  environment  (12,13).  Therefore,  in  our  anal¬ 
ysis,  we  used  as  operating  points  false-positive  rates  of 
0.5  per  image  for  masses  and  0.2  per  image  for  microcal¬ 
cification  clusters,  similar  to  the  reported  performance 
levels  of  commercially  available  CAD  systems  (10,11) 
and  our  own  experimental  results  (13).  At  these  false¬ 
positive  rates,  we  compared  the  detection  sensitivities  for 
masses  and  clusters  between  the  current  and  prior  images. 
For  malignant  mass  and  microcalcification  cluster  regions 
that  were  initially  identified  as  suspicious  by  the  CAD 


scheme  on  both  current  and  prior  images  but  were  ulti¬ 
mately  cued  only  on  current  images,  we  analyzed  changes 
in  the  main  features  used  in  the  ANN,  to  clarify  why  low 
output  scores  were  generated  for  these  regions  on  prior 
images  (or  why  these  were  ultimately  discarded  by  the 
scheme). 

Both  “case-based”  and  “region-based”  sensitivities 
were  assessed  in  this  study.  Case-based  sensitivity  in¬ 
cludes  correct  cues  of  an  abnormality  (eg,  a  mass  or  clus¬ 
ter)  on  one  or  both  views  (CC,  MLO,  or  both);  a  “case” 
here  means  one  abnormality  and  not  necessarily  one  pa¬ 
tient.  Region-based  sensitivity  includes  correct  cues  of  an 
abnormality  depicted  independently  on  either  view  (CC  or 
MLO).  The  same  abnormality  depicted  on  both  views 
(CC  and  MLO)  is  considered  two  independent  true-posi¬ 
tive  findings.  Region-based  sensitivity  was  computed  ac¬ 
cording  to  the  number  of  correctly  detected  regions, 
rather  than  abnormalities. 


RESULTS 


Figures  1  and  2  demonstrate  the  case-based  FROC 
curves  for  current  and  prior  images  for  the  detection  of 
masses  and  microcalcification  clusters,  respectively.  Fig¬ 
ures  3  and  4  demonstrate  the  region-based  FROC  curves 
for  mass  and  cluster  detection.  Figures  5  and  6  demon¬ 
strate  FROC  curves  of  case-based  detection  sensitivity 
versus  false-positive  rate  for  malignant  mass  and  cluster 
detection,  respectively,  after  the  exclusion  of  biopsy- 
proven  benign  cases.  The  CAD  scheme  detected  (though 
at  a  high  false-positive  rate)  94%  of  masses  (162  of  172) 
and  95%  of  microcalcification  clusters  (122  of  128)  in  the 
current  image  database. 

For  the  prior  image  database,  the  maximum  detection 
sensitivities  were  86%  for  masses  (148  of  172)  and  73%> 
for  clusters  (93  of  128),  as  shown  in  Figures  1  and  2. 
After  benign  abnormalities  were  excluded,  similar  maxi¬ 
mum  sensitivities  were  obtained  for  mass  and  cluster  de- 
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False-positive  mass  regions  per  image 

Figure  1.  Comparison  of  case-based  CAD  performance  for  de¬ 
tection  of  masses  on  200  current  and  prior  mammographic  cases. 
The  test  set  included  1 72  masses. 


Figure  2.  Comparison  of  case-based  CAD  performance  for  detec¬ 
tion  of  microcalcification  clusters  on  200  current  and  prior  mammo¬ 
graphic  cases.  The  test  set  included  128  true-positive  clusters. 


tections:  95%  for  both  masses  (95  of  100)  and  clusters 
(75  of  79)  on  the  current  images  and  76%  (76  of  100) 
and  59%  (47  of  79)  for  masses  and  clusters,  respectively, 
on  prior  images  (Figs  5,  6).  The  scheme  has  comparable 
performance  levels  for  detecting  malignant  or  benign 
findings  on  current  images.  Its  sensitivity  for  malignant 
lesions,  however,  is  significantly  lower  than  that  for  be¬ 
nign  lesions  on  prior  images  (P  <  .01). 

With  specific  thresholds  set  on  the  ANN-generated 
scores  (0.55  for  mass  detection  and  0.5  for  cluster  detec¬ 
tion),  the  false-positive  rates  in  our  database  were  0.5  per 
image  for  masses  and  0.2  per  image  for  microcalcification 


False-positive  mass  regions  per  image 

Figure  3.  Comparison  of  region-based  CAD  performance  for 
detection  of  masses  on  current  and  prior  images.  The  test  set 
included  336  mass  regions. 


Figure  4.  Comparison  of  region-based  CAD  performance  for 
detection  of  microcalcification  clusters  on  current  and  prior  im¬ 
ages.  The  test  set  included  248  cluster  regions. 


clusters.  At  these  threshold  levels,  our  CAD  scheme  de¬ 
tected  78%  of  malignant  masses  (78  cases  or  109  regions) 
and  80%  of  malignant  clusters  (63  cases  or  92  regions) 
on  the  current  images.  Suspicious  regions  that  were  cued 
in  the  corresponding  areas  of  prior  images  were  53  “mass 
regions”  (or  40  “masses”)  and  5 1  “cluster  regions”  (or  36 
“clusters”).  The  case-based  sensitivities  for  prior  images 
were  40%  (40  of  100)  for  malignant  masses  and  46%  (36 
of  79)  for  malignant  clusters. 

For  mass  detection,  24  malignant  regions  were  cued  on 
the  current  images  but  not  on  the  prior  images.  In  six 
features  used  in  the  ANN  (17)  for  mass  detection,  the 
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CAD  ON  CURRENT  AND  PRIOR  IMAGES 


Figure  5.  Comparison  of  case-based  CAD  performance  for  ma¬ 
lignant  mass  detection.  The  test  set  included  100  malignant 
masses. 


False-positive  cluster  regions  per  image 

Figure  6.  Comparison  of  case-based  CAD  performance  for  ma¬ 
lignant  microcalcification  cluster  detection.  The  test  set  included 
79  malignant  clusters. 


average  feature  values  changed  significantly  (P  <  .05) 
between  current  and  prior  images.  Table  2  summarizes 
the  changes  in  these  features.  The  estimated  “size”  and 
“contrast”  of  the  cued  regions  were  significantly  smaller 
( P  <  .05)  on  prior  images.  In  general,  because  of  these 
changes,  the  mass  regions  depicted  on  prior  images  are 
more  difficult  to  identify,  not  only  for  human  observers 
but  also  for  the  CAD  schemes  optimized  on  a  different 
set  of  cases  (19,20). 

For  microcalcification  detection,  21  malignant  cluster 
regions  were  cued  on  the  current  images  but  not  on  the 
prior  images  due  to  lower  ANN-generated  scores.  Of  13 
features  used  in  the  ANN  for  cluster  detection  (17),  only 


two  had  a  significant  change  (P  <  .05)  in  average  values 
between  current  and  prior  images.  As  may  be  expected, 
one  was  the  number  of  single  microcalcifications  detected 
in  a  cluster,  which  was  25%  smaller  on  prior  images  (5.6 
per  cluster  vs  8.2  on  current  images).  The  second  was  the 
average  digital  value  contrast  of  a  single  microcalcifica¬ 
tion,  which  was  24%  less  on  prior  images. 


DISCUSSION 


There  is  a  growing  interest  in  using  CAD  to  help  de¬ 
tect  breast  cancers  at  an  earlier  stage.  Hence,  there  is  a 
need  to  detect  some  abnormalities  depicted  on  prior  im¬ 
ages  (9,14,21).  In  previous  studies,  CAD  schemes  were 
applied  mainly  to  cases  interpreted  as  recommended  for 
recall  by  a  panel  of  radiologists  during  retrospective  re¬ 
views.  In  this  study,  we  applied  a  CAD  scheme  to  prior 
examinations  of  cases  that  ultimately  underwent  biopsy 
because  of  findings  during  a  subsequent  examination.  Our 
experimental  results  showed  that  76%  of  malignant 
masses  and  59%  of  clusters  associated  with  malignancies 
were  detected  as  suspicious  with  the  CAD  scheme  (Figs 
5,  6).  By  applying  thresholds  on  the  ANN  scores  to  gen¬ 
erate  false-positive  rates  of  0.5  per  image  for  mass  re¬ 
gions  and  0.2  per  image  for  cluster  regions,  the  scheme 
ultimately  detected  42%  of  cancers  depicted  on  prior  im¬ 
ages.  This  is  in  the  range  of  the  fraction  of  cases  reported 
to  be  visible  at  prior  examinations  in  other  studies  (9). 

The  detection  of  abnormalities  was  found  to  be  more 
sensitive  to  changes  in  feature  values  on  the  prior  images. 
For  example,  reducing  the  false-positive  rate  for  mass 
detection  from  1.0  to  0.5  per  image  decreased  sensitivity 
by  14%  (from  0.88  to  0.76)  on  the  current  images  and 
31%  (from  0.58  to  0.40)  on  the  prior  images  (Fig  5).  Our 
experiment  also  suggested  that  the  set  of  features  that 
optimally  represent  malignant  masses  may  be  somewhat 
different  on  current  and  prior  images  (Table  2).  This  ob¬ 
servation  is  in  agreement  with  that  in  another  study  in 
which  a  stepwise  linear  discriminant  analysis  selected 
different  sets  of  optimal  features  to  represent  masses  de¬ 
picted  on  current  and  prior  images  (22). 

Unlike  other  studies  using  a  commercial  CAD  product 
(8,14),  for  which  only  one  operating  point  (detection  sen¬ 
sitivity  at  a  given  false-positive  rate)  can  be  analyzed,  this 
study  generated  complete  FROC  curves.  Hence,  one  can 
compare  the  performance  difference  at  any  operating 
point  and  investigate  the  effect  of  feature  changes  on  per¬ 
formance.  This  approach  may  represent  an  important  first 
step  toward  reoptimizing  CAD  schemes  that  improve  the 
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Table  2 

Average  Values  of  Six  Features  and  Change  in  Values  between  Current  and  Prior  Images  for  24  Malignant  Masses 

Value 

Region  Size 
(mm2) 

Contrast 
(digital  value) 

Circularity 

Standard 
Deviation  of 
Radial  Length 

Pixel  Ratio  of 
Local  Minimum 
Digital  Value 

Region 

Conspicuity 

Average  for  current  images 
Average  for  prior  images 
Change  (%) 

133.1  ±  100.2 

66.3  ±  41.4 

-50.2 

42.1  ±  10.7 

33.9  ±  12.3 

-19.5 

0.83  ±  0.07 

0.76  ±  0.09 

-8.4 

0.21  ±  0.07 

0.29  ±  0.08 

+  38.1 

0.13  ±  0.05 

0.21  ±  0.07 

+61.5 

4.7  ±  1.5 

3.7  ±  0.7 

-21.3 

Note.— These  24  masses  were  ultimately  cued  on  the  current  images  but  not  on  the  prior  images  (P  <  .05  for  each  of  the  six  fea¬ 
tures).  Mean  values  are  given  ±  standard  deviations. 


detection  of  breast  cancers  at  an  earlier  stage.  Such  early 
detection  will  become  increasingly  important,  because  the 
average  stage  at  detection  will  gradually  shift  toward  that 
seen  on  prior  images  as  compliance  improves  and  women 
undergo  several  periodic  examinations. 

Finally,  full-field  digital  mammographic  systems  are 
rapidly  becoming  available  (23,24).  Although  we  did  not 
include  them  in  this  study,  we  expect  that  the  questions 
we  considered  are  as  relevant  to  full-field  digital  mammo¬ 
grams  as  to  digitized  film  images. 
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ABSTRACT 

We  evaluated  a  telemammography  system  for  reviewing  and  rating  screening  mammography  in  a  clinical  setting.  Three 
remote  sites  transmitted  306  exams  to  a  central  site.  Films  were  digitized  at  50  micron  pixel  dimensions  and 
compressed  at  a  50:1  ratio.  At  the  central  site  images  were  displayed  on  a  workstation  with  two  high-resolution 
monitors.  Five  radiologists  reviewed  and  rated  the  screens  without  the  availability  of  prior  images  or  additional 
information  indicating:  1)  if  additional  procedures  were  needed,  2)  which  breast  was  involved,  and  3)  when  appropriate, 
the  recommended  additional  procedures.  During  the  actual  clinical  interpretation  13.7%  (42  cases)  of  the  patients  were 
recalled  for  additional  procedures.  During  the  retrospective  review  radiologists  1,  2,  3,  4,  and  5  recommended 
additional  procedures  for  26.1%,  29.1%,  36.3%,  45.1%,  and  54.2%  of  the  cases,  respectively.  The  agreements  between 
the  clinical  interpretation  and  radiologists  1,  2,  3,  4,  and  5  were  77.8%,  76.1%,  69.0%,  62.7%,  and  53.6%,  respectively. 
The  exceedingly  high  percentage  of  recommended  additional  procedures  using  the  workstation  was  attributed  to  lack  of 
prior  images  or  additional  information,  the  knowledge  that  case  management  was  not  affected,  and  the  observers’ 
expectation  for  an  enriched  case  mix. 

Keywords:  Teleradiology,  human  performance,  recall  rate,  breast  cancer  screening,  mammography. 

1.  INTRODUCTION 

Teleradiology  can  challenge  typical  radiology  practices  in  areas  ranging  from  personnel  assignments  to  data 
management.  In  remote  or  underserved  clinics  in  may  be  necessary  to  evaluate  personnel  qualifications  in  regards  to 
deciding  if  teleradiology  is  appropriate  and  the  necessary  radiographic  procedures.1'3  Many  teleradiology  systems 
employ  image  processing  techniques  to  manage  the  digital  image  data  in  terms  of  data  acquistion,4'9  transmission  time 
(e.g.,  compression,41011’12  cropping,13  image  selection14),  and  image  display  .7’810'11'13'14’15  The  effects  of  data 
management  techniques  on  diagnostic  image  quality  are  application  specific.  Comparisons  between  film-based  and 
digitized  image-based  (film  digitization)  diagnostic  radiographic  interpretation  have  produced  mixed  results.  In  some 
laboratory  studies  the  area  under  the  receiver  operating  characteristic  (ROC)  curve,  sensitivity,  and  accuracy  have  been 
shown  to  be  slightly  greater  for  film-based  interpretation,4'7’16'17  but  the  differences  were  generally  not  statistically 
significant.  Reported  specificity  has  been  relatively  equivalent  for  the  two  interpretation  methods ,4'7'16'17 

The  high-spatial  resolution  necessary  to  interpret  mammographic  images  presents  unique  challenges  when  designing 
and  implementing  a  telemammography  system.  Improvements  in  image  quality  of  x-ray  film  mammography  have  been 
associated  with  improvements  in  breast  cancer  detection.18'23  Therefore,  it  is  important  that  the  image  processing 
techniques  of  a  telemammography  system  do  not  degrade  the  diagnostic  image  quality  of  the  digital  (full-field  digital 
mammography  (FFDM))  or  digitized  (film  digitization)  mammographic  images. 
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Mammography  interpretation  has  been  reported  as  relatively  equivalent  for  film  mammography  and  digitized 
mammographic  images.  Fajardo  et  al.24  (1990)  found  film  mammography  statistically  superior  for  detecting  skin  and 
nipple  abnormalities  compared  to  digitized  mammography  in  an  ROC  study,  but  found  the  two  methods  equivalent  for 
detecting  microcalcifications  and  masses.  An  ROC  study  performed  by  Nab  et  al.25  (1992)  found  that  the  diagnostic 
performance  of  film  and  digitized  mammography  were  comparable.  Powell  et  al.26  (1999)  reported  that  film 
mammography  was  slightly  superior  to  digitized  mammography  in  several  diagnostic  measures  (i.e.,  accuracy,  false¬ 
positive  rates,  and  callback  rates  for  mammograms  with  normal  and  malignant  findings),  but  only  the  callback  rates  for 
normal  findings  were  statistically  different.  The  callback  rates  for  benign  findings  were  slightly  better  for  digitized 
mammography.  A  follow-up  study  by  Powell  et  al.27  (2000)  compared  film  mammography  to  wavelet-compressed 
digitized  mammographic  images.  The  only  statistically  significant  finding  was  that  the  false  positive  rate  was  lower  for 
compressed  digitized  images  compared  to  film  mammography.  Compressed  digitized  images  were  also  slightly  better 
(though  not  statistically)  in  terms  of  callback  for  mammograms  with  normal  and  benign  findings.  Film  mammography 
was  slightly  better  (though  not  statistically)  for  callback  rates  for  depicting  malignant  abnormalities. 

This  manuscript  presents  a  preliminary,  retrospective  clinical  evaluation  of  an  inexpensive,  high-quality,  multi-site 
telemammography  system28,29  for  the  review  of  screening  mammography  examinations.  The  study  was  designed  to 
assess  the  effectiveness  of  the  system  for  the  review  of  breast  cancer  screening  mammography  with  the  objective  to 
assess  its  possible  use  in  determining  the  need  for  additional  procedures  (rather  than  primary  diagnosis).  The  limited 
retrospective  review  was  conducted  using  only  digitized  mammographic  images  without  the  benefit  of  prior  images  or 
any  additional  information.  Five  radiologists  reviewed  and  rated  screening  exams  using  the  telemammography  system, 
and  their  results  were  compared  to  the  actual  clinical  interpretations  of  the  same  cases  regarding  the  need  for  additional 
procedures.  It  was  anticipated  that  in  this  experimental  protocol  the  number  of  cases  recommended  for  additional 
procedures  would  be  greater  during  the  limited  telemammography  review  compared  to  the  clinical  interpretation. 

2.  METHODS 


2.1  Case  selection 

The  306  cases  retrospectively  evaluated  in  this  study  originated  from  patients  who  underwent  breast  cancer  screening 
mammography  at  three  woman’s  imaging  centers.  The  mammography  technologists  at  these  centers  were  instructed  to 
select  an  approximately  equal  number  of  cases  they  (the  technologists)  believed  may  and  may  not  need  additional 
imaging  procedures  for  complete  evaluations.  Cases  were  selected  by  the  technologists  in  a  prospective  mode  and  they 
did  not  know  at  the  time  of  selection  whether  or  not  the  patient  would  actually  be  recalled  for  additional  procedures 
during  the  clinical  interpretation.  The  mean  patient  age  was  53.8  years  ranging  from  35  to  88  years  old.  The  actual, 
subsequent  clinical  interpretation  categorized  each  case  using  the  Breast  Imaging  Reporting  and  Data  System 
(BIRADS)  (Table  1).  The  four  routine  screening  mammographic  films  of  the  left  and  right  craniocaudal  views  (LCC  & 
RCC),  and  left  and  right  mediolateral  oblique  views  (LMLO  &  RMLO)  were  used  to  review  and  rate  cases  in  this  study. 

Table  1 

Distribution  of  BIRADS  categories  as  a  result  of  clinical 
interpretation  of  the  cases 

BIRADS  Category _ 0 _ 1 _ 2  total 

Number  of  cases  42  206  58  306 

2.2  Telemammography  system 

The  cases  for  this  study  were  transmitted  from  the  three  centers  (remote  sites)  to  Magee-Womens  Hospital,  Pittsburgh, 
PA,  USA  (central  site)  using  an  inexpensive,  high-quality,  multi-site  telemammography  system.  The  operation  of  the 
system  including  digitization  the  mammographic  films,  digital  image  processing,  data  transmission,  and  image  display 
were  conducted  under  routine  operating  procedures  and  are  described  in  detail  by  Drescher  et  al.29  (2003).  A  brief 
description,  as  relevant  to  this  study  is  provided  below. 

2.2.1  Central  and  remotes  sites 

The  central  site  telemammography  workstation  is  connected  to  two  high-resolution  (2048  x  2560)  8-bit  grayscale 
portrait  monitors  at  a  nominal  setting  of  80  ftL  (DS5100P,  Clinton  Electronics,  Rockford.  IL,  USA).  A  dual  1.2  GHz 
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multi-processor  (Athlon  MP,  Advanced  Micro  Device,  Sunnyvale  CA,  USA)  with  2  GB  of  RAM  powers  the 
workstation  which  operating  under  Microsoft  Windows  2000  Server  (Microsoft  Corporation,  Redmond,  WA,  USA). 
The  workstation  is  equipped  with  56K  hardware  modems  (U.S.  Robotics,  Rolling  Meadows,  IL,  USA)  and  an  ethernet 
network  cards  (OfficeConnect  10/100  NIC,  3COM,  Santa  Clara,  CA,  USA)  for  communication  with  the  remote  sites. 

The  computers  at  the  remote  sites  operate  under  Microsoft  Windows  2000  Workstation  powered  by  a  900MHz 
processor  (Athlon  900,  Advanced  Micro  Device,  Sunnyvale  CA,  USA)  with  512  MB  of  RAM.  The  mammographic 
films  are  digitized  using  a  high-resolution,  laser  film  digitizers  (Lumiscan  85,  Eastman  Kodak,  Rochester,  NY,  USA)  at 
50  micron  pixel  dimensions  and  12-bit  grayscale.  Data  communication  from  the  remote  site  computers  is  conducted  via 
56K  hardware  modems  and  ethernet  network  cards  (Integrated  PRO/100  S  Desktop  Adapter,  Intel  Corporation,  Santa 
Clara,  CA,  USA).  Sites  1  and  2  are  15  and  20  miles  from  the  central  site,  respectively,  and  transmit  data  across  Plain 
Old  Telephone  System  (POTS)  lines.  Site  3  is  90  miles  from  the  central  site  and  transmits  data  across  a  Local  Area 
Network  (LAN). 

2.2.2  Image  processing 

The  first  image  processing  step  was  to  perform  an  automated  cropping  that  removed  the  non-tissue  area  surrounding  the 
breast.  Next,  the  image  data  were  compressed  using  the  irreversible  (lossy),  9/7  transform,  wavelet-based  JPEG  2000 
method  at  a  50: 1  compression  ratio.  Prior  to  transmission  from  the  remote  sites,  the  data  packets  were  encrypted  using 
strong  128  bit  Microsoft  Point-to-Point  Encryption  (MPPE)  with  Microsoft  Challenge  Handshake  Authenticate  Protocol 
(CHAP)  version  2. 

Upon  arrival  to  the  central  site  the  image  data  were  decrypted  and  decompressed.  The  decompressed  images  data  were 
minimally  unsharp  masked  to  enhance  display  on  the  workstation  monitors.  The  image  data  range  was  maximized  for 
display  by  re-scaling  the  image  data  from  0  to  4095.  To  facilitate  image  viewing  default  look-up  table  (LUT)  values 
were  automatically  calculated  based  on  the  typically  bimodal  pixel  value  distribution  (histogram).  The  images  were 
restored  to  full  height,  but  not  the  full  width,  by  padding  (filling)  prior  to  image  display. 


Fig.  1 .  Telemammography  workstation  at  the  central  site  pictured  in  the  default  image 
display  format. 
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2.2.3  Central  site  image  display 

There  are  several  mouse-driven  image  display  features  on  the  central  site  workstation  available  to  the  user  during  case 
review.  Image  display  formats  possible  included:  one  image/monitor,  two  images/  monitor,  or  four  images/monitor. 
To  duplicate  our  standard  fdm  presentation  LCC  and  RCC  are  displayed  on  the  left  monitor,  and  LMLO  and  RMLO 
on  the  right  monitor  as  the  default  presentation  (Fig.  1). 


The  typical  display  resolution  was  approximately  100  micron  pixel  dimensions  for  one  image/monitor  and  200  micron 
pixel  dimensions  for  two  images/monitor.  Images  can  be  magnified  by  a  free-moving  magnification  box  or  quadrant 
panning.  The  magnification  box  size  varied  dependent  on  the  image  display  format;  for  one  image/monitor  the  box 
was  511  x  566  pixels  and  for  two  images/monitor  the  box  was  204  x  266  pixels.  The  LUT  settings  could  be  adjusted 
by  the  user  by  moving  the  mouse  horizontally  or  vertically.  Selected  LUT  settings  could  be  applied  (at  user’s  option) 
to  all  images  associated  with  the  case  and  could  be  reset  to  the  default  (automated)  values  at  any  point. 


2.3  Reviewing  and  rating  cases 

Five  experienced  radiologists  (each  reading  over  2000  mammograms  per  year)  reviewed  and  rated  each  case  on  the 
telemammography  workstation.  Cases  were  randomly  presented  in  each  session.  The  rating  form  for  each  case  was 
presented  on  the  workstation  monitors  and  completed  using  the  computer  mouse  (Fig.  2).  The  computerized  scoring 
form  recorded:  (1)  if  additional  procedures  were  indicated,  (2)  use  of  prior  images  (disabled  for  this  study),  (3)  which 
breast  was  involved,  and  (4)  when  appropriate,  the  specific  recommended  procedure.  The  radiologists’  reviews  were 
conducted  based  entirely  on  the  four  mammographic  views  (LCC,  RCC,  LMLO,  &  RMLO),  without  additional, 
potentially  relevant  information  (e.g.,  prior  images,  prior  reports,  patient  history).  The  radiologists  were  informed  of  the 
case  origination,  but  not  the  case  selection  criteria.  The  written  instructions  to  observers  regarding  case  review  were: 

In  this  phase  of  testing  our  telemammography  system,  we  would  like  you  to  review  cases  and  take  a  few 
seconds  to  quickly  decide  whether  or  not  the  case  should  be  recalled  for  additional  procedures.  These 
cases  are  routine  screening  mammograms.  You  will  fill  out  a  computer  form  to  indicate  if  a  case  should 
be  recalled.  If  you  choose  to  recall  the  case  you  must  check  off  which  additional  procedures  you  would 
recommend  for  each  breast.  A  “done”  button  on  the  bottom  of  the  form  will  bring  up  the  next  case.  The 
computer  will  automatically  track  the  cases  that  you  have  completed  and  load  your  remaining  cases;  the 
count  will  be  in  the  bottom  of  the  right  screen. 


ADDITIONAL  IMAGING  FORM 


PATIENT:  |i  MRN:  |  EXAM  DATE:  j 

RADIOLOGIST  |  SEND  DATE:  j 

RECALL  THIS  PATIENT  FOR  ADDITIONAL  EVALUATION:  YES  NO  |~ 

WERE  PRIOR  IMAGES  OR  INFORMATION  USED  IN  THE  DECISION:  YES  f~  NO  f~ 

RECOMMENDED  ADDITIONAL  IMAGE  FOLLOW-UP  ON  WHICH  BREAST:  RIGHT  T  LEFT  jT  BOTH  f 


RECOMMENDED  ADDITIONAL  IMAGES  (check  all  that  apply): 

RIGHT 

MAGNIFICATION  WITHOUT  COMPRESSION  SPOT:  T 

LEFT 

r 

TANGENTIAL  FOR  CALCIFICATIONS: 

RIGHT 

n  ::v 

LEFT 

COMPRESSION  SPOT  WITHOUT  MAGNIFICATION: 

r 

r 

ROLL  VIEWS  FOR  LOCALIZATION: 

r 

r 

COMPRESSION  SPOT  WITH  MAGNIFICATION: 

r 

r 

30  DEGREE  VIEWS: 

r 

r 

EXAGGERATED  CRANIAL-CAUDAL  VIEW: 

r 

r 

ULTRASOUND: 

r 

r 

.  r. 

ONFj 

Cancel  j 

Fig.  2.  Computer  scoring  form  complete  by  the  radiologists  for  each  case. 
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2.4  Data  analysis 

The  radiologists’  recommendations  using  the  telemammography  workstation  were  compared  with  the  actual  clinical 
interpretation  during  the  original  clinical  review.  The  comparisons  were  done  using  agreement/disagreement  measures. 
The  disagreements  when  clinical  interpretation  indicated  no-recall  and  telemammography  interpretation  indicated  recall 
were  further  evaluated  based  on  the  actual  BIRADS  ratings  during  the  clinical  interpretation. 

3.  RESULTS 

Image  quality,  effects  of  the  image  processing,  and  features  of  the  multi-site  telemammography  system  were 
subjectively  reported  as  more  than  adequate  for  reviewing  screening  mammography  examinations  and  generally  were 
well-received  by  the  radiologists.  The  cropped  images  retained  all  breast  tissue  areas  and  were  visibly  appealing  for 
image  review.  The  automated  LUT  settings  were  normally  acceptable  and  were  changed  in  approximately  1 0%  of  the 
cases  during  review.  Magnification  allowed  detailed  review  of  the  breast  tissue  patterns,  particularly 
microcalcifications.  Although  there  were  some  detectable  differences  at  extremely  high  magnifications  between  non- 
compressed  and  compressed  images  at  a  50: 1  compression  ratio,  the  images  were  subjectively  judged  to  “not  affect  the 
diagnostic  quality.” 

The  preliminary  assessment  of  the  limited  case  review  (i.e.,  no  prior  images,  prior  reports,  or  patient  history)  of 
screening  exams  using  the  multi-site  telemammography  system  resulted  in  an  exceedingly  high  recommended  recall 
rates  and  modest  agreement  between  the  actual  clinical  interpretation  and  the  radiologists’  recommendations  using  the 
telemammography  system.  During  the  actual  clinical  interpretation  13.7%  (42)  of  the  cases  were  recalled  (BIRADS  = 
0).  Radiologists  1,  2,  3, 4,  and  5  recall  rates  were  26.1%  (80),  29.1%  (89),  36.3%  (111),  45.8%  (138),  and  54.2%  (166), 
respectively,  when  using  the  telemammography  system  to  determine  the  need  for  additional  procedures  (Table  2).  The 
overall  agreement  between  the  clinical  interpretation  and  the  recommendations  of  radiologists  1,  2,  3,  4,  and  5  were 
77.8%,  76.1%,  69.0%,  62.7%,  and  53.6%,  respectively.  Kappa  for  radiologists  1,  2,  3,  4,  and  5  were  0.32,  0.32,  0.22, 
0.20,  and  0.13,  respectively. 

Table  2 

Reviewing  and  rating  screening  mammography  exams,  telemammography  workstation 
recommendations  versus  clinical  interpretation 

Telemammography  Clinical  interpretation 

recommendations _ recall  (n  -  42) _ no-recall  (n  =  264) _ Total _ 


Radiologist  1 

recall 

8.8%  (27) 

17.3%  (53) 

26.1%  (80) 

no-recall 

4.9%  (15) 

69.0%  (211) 

73.9%  (226) 

Radiologist  2 

recall 

9.5%  (29) 

19.6%  (60) 

29.1%  (89) 

no-recall 

4.2%  (13) 

66.7%  (204) 

70.9%  (217) 

Radiologist  3 

recall 

9.5%  (29) 

26.8%  (82) 

36.3%  (111) 

no-recall 

4.2%  (13) 

59.5%  (182) 

63.7%  (195) 

Radiologist  4 

recall 

10.8%  (33) 

34.3%  (105) 

45.1%  (138) 

no-recall 

2.9%  (9) 

52.0%  (159) 

54.9%  (168) 

Radiologist  5 

recall 

10.8%  (33) 

43.5%  (133) 

54.2%  (166) 

no-recall 

2.9%  (9) 

42.8%  (131) 

45.8%  (140) 

The  cases  when  the  recommendation  using  the  telemammography  system  was  “recall”  and  the  clinical  interpretation 
indicated  “no-recall”  represented  a  large  percentage  of  the  disagreement,  and  nearly  one  half  had  some  type  of  findings 
reported  during  the  clinical  review.  The  disagreement  when  the  clinical  interpretation  indicated  “no-recall”  and  the 
telemammography  indicated  “recall”  accounted  for  77.9%,  82.2%,  86.3%,  92.1%,  and  93.7%  of  the  total  disagreement 
for  radiologists  1,  2,  3,  4,  and  5,  respectively  (Table  2).  Further  evaluation  of  these  disagreement  cases  revealed  that 


Proc.  ofSPIEVol.  5033  277 


cases  with  a  BIRADS  category  of  2  during  the  clinical  interpretation  accounted  for  49.1%,  53.3%,  51.2%,  34.3%,  and 
36.1%  of  the  disagreement  cases  for  radiologists  1,  2,  3, 4,  and  5,  respectively  (Table  3). 

Table  3 

Disagreement  cases  when  the  clinical  interpretation  was  no-recall  and  the 
telemammography  recommendation  was  recall  for  different  BIRADS  ratings 
during  the  clinical  interpretation 


BIRADS  category 

Disagreement  cases _ 1  (n  =  206) _ 2  (n  =  58) 


Radiologist  1  (n  =  53) 

50.9%  (27) 

49.1%  (26) 

Radiologist  2  (n  =  60) 

46.7%  (28) 

53.3%  (32) 

Radiologist  3  (n  =  82) 

48.8%  (40) 

51.2%  (42) 

Radiologist  4  (n  =  105) 

65.7%  (69) 

34.3%  (36) 

Radiologist  5  (n  =  133) 

63.9%  (85) 

36.1%  (48) 

Average  (n  =  86.6) 

55.2%  (49.8) 

44.8%  (36.8) 

4.  DISCUSSION 


The  review  of  breast  cancer  screening  mammography  by  five  experienced  radiologists  using  the  telemammography 
system  demonstrated  that  the  system  was  adequate  for  reviewing  the  mammographic  image  data.  The  limited, 
retrospective  review  of  screens  using  the  telemammography  system  with  only  mammographic  image  data  (i.e.,  no  prior 
images,  prior  reports,  or  padent  history)  produced  modest  agreement  with  the  actual  clinical  interpretation.  The 
agreement  between  the  limited  telemammography  review  and  clinical  interpretation  for  five  radiologists  ranged  from 
53.6%  to  77.8%  and  Kappa  ranged  from  0.13  to  0.32.  On  average  the  radiologists  recommended  additional  procedures 
using  the  limited  telemammography  system  in  38.2%  of  cases  which  was  exceedingly  high  compared  with  13.7  %  of 
patients  actually  recalled  in  this  group  during  the  clinical  interpretation. 

The  majority  of  the  disagreement  between  the  two  review  formats  occurred  when  the  telemammography  review  resulted 
in  a  recommendation  for  additional  procedures  and  the  clinical  interpretation  did  not,  accounting  for  an  average  of 
86.4%  of  the  disagreement  cases  for  the  five  radiologists.  Of  these  disagreement  cases  (clinical  no-recall  and 
telemammography  recall),  on  average  across  the  radiologists  44.8%  of  the  patients  had  a  clinical  BIRADS  category  of 
2.  That  is,  when  findings  were  detected  using  the  telemammography  system  under  restricted  conditions,  but  the  history 
of  the  findings  (i.e.,  new,  increased,  or  unchanged)  was  unavailable,  the  radiologists  tended  to  recommend  additional 
procedures.  Another  potential  partial  explanation  for  the  high  recall  rate  was  the  radiologists’  expectation  of  an 
“enriched”  sample  population  because  of  their  knowledge  that  this  is  a  laboratory  study.  In  addition,  the  mere  fact  that 
patient  recall  does  not  affect  clinical  management  tends  to  produce  over  reading. 

High  recall  rates  were  similarly  observed  by  Elmore  et  al.30  (1994),  where  11-65%  of  patients  without  cancer  were 
recommended  for  immediate  workup.  In  the  Elmore  study,  prior  images  were  not  available  for  any  of  the  cases 
reviewed  and  clinical  history  was  not  available  for  every  case.  They  also  attributed  the  high  recall  rates  to  the 
radiologists’  knowledge  of  an  “enriched”  sample  population  and  study  participation. 

Although  the  limited,  retrospective  review  using  the  telemammography  system  produced  modest  agreement  with  the 
actual  clinical  interpretation,  the  feasibility  of  the  system  use  for  such  a  review  was  clearly  demonstrated  and  well- 
received  by  the  radiologists.  Current  efforts  have  begun  to  add  information  such  as  text  communication  between  the 
technologist  (remote  site)  and  radiologist  (central  site)  to  the  information  transmitted  with  each  case. 
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ABSTRACT 

We  investigated  a  new  approach  to  improve  the  performance  of  a  computer-aided  detection  (CAD)  scheme  in 
identifying  masses  depicted  on  images  acquired  earlier  (“prior”).  The  scheme  was  trained  using  a  dataset  with  simulated 
mass  features.  From  a  database  with  images  acquired  during  two  consecutive  examinations,  100  locations  matched  pairs 
of  malignant  mass  regions  were  selected  in  both  the  “current”  and  the  most  recent  “prior”  images.  While  reviewing  the 
current  images,  mass  regions  were  identified  and  as  a  result  biopsies  were  ultimately  performed.  Prior  images  were  not 
identified  as  suspicious  by  radiologists  during  the  original  interpretation.  The  same  number  of  false-positive  regions  was 
also  selected  in  both  current  and  prior  images.  The  selected  regions  were  then  randomly  divided  into  training  and  testing 
datasets  with  50  true-positive  and  50  false-positive  regions  in  each.  For  each  selected  region,  five  features;  area,  contrast, 
circularity,  normalized  standard  deviation  of  radial  length,  and  conspicuity;  were  computed.  The  ratios  of  the  average 
difference  of  five  feature  values  between  current  and  prior  mass  regions  in  the  training  datasets  were  also  computed. 
Multiplying  these  ratios  by  the  computed  values  in  cuixent  mass  regions,  we  generated  a  new  dataset  of  simulated 
features  of  “prior”  mass  regions.  Three  artificial  neural  networks  (ANN)  were  trained.  ANN-1  and  ANN-2  were  trained 
using  training  datasets  of  current  and  prior  regions,  respectively.  ANN-3  was  trained  using  simulated  “prior”  dataset. 
The  performance  of  three  ANNs  was  then  evaluated  using  the  testing  dataset  of  prior  images.  Areas  under  ROC  curves 

( A7 )  were  0.613  ±  0.026  for  ANN-1,  0.678  ±  0.029  for  ANN-2,  and  0.667  ±  0.029  for  ANN-3,  respectively.  This 
preliminary  study  demonstrated  that  one  could  estimate  an  average  change  of  feature  values  over  time  and  “adjust”  CAD 
performance  for  better  detection  of  masses  at  an  earlier  stage. 

Keywords:  Computer-aided  detection,  Mammography,  Mass  detection,  Artificial  neural  network 


1.  INTRODUCTION 

Computer-Aided  Detection  (CAD)  systems  are  currently  used  in  a  large  number  of  medical  institutions  around  the  world 
to  assist  radiologists  in  reading  and  interpreting  mammograms  in  the  screening  environment  [1-3].  A  large  number  of 
studies  have  been  conducted  to  assess  the  possible  impact  of  CAD  systems  on  radiologists’  performance.  Although  there 
is  no  general  agreement  on  whether  and  how  CAD  systems  help  radiologists  improve  their  diagnostic  accuracy  [3-6], 
several  studies  demonstrated  that  the  performance  of  the  CAD  scheme  itself  might  be  an  important  factor  to  increase 
radiologists’  confidence  to  accept  and  act  on  the  CAD  cues  and  help  to  improve  their  diagnostic  accuracy  when  using 
such  tools  [6-8]. 

Current  guidelines  recommend  periodic  mammography  screening  for  women  over  the  age  of  40  [9].  As  compliance 
increases  in  the  general  population,  a  large  fraction  of  patients  will  have  undergone  series  of  consecutive  mammographic 
examinations.  As  a  result,  detected  breast  cancers  will  in  time,  “shift”  on  the  average  toward  an  earlier  stage.  In  fact, 
retrospective  review  have  indicated  that  a  large  fraction  of  breast  cancers  that  are  identified  by  radiologists  were  also 
visible  in  prior  images  [10],  It  is  expected  that  comparison  with  prior  images  could  over  time  help  radiologist  detect 


Medical  Imaging  2003:  Image  Processing,  Milan  Sonka,  J.  Michael  Fitzpatrick,  Editors, 
Proceedings  of  SPIE  Voi.  5032  (2003)  ©  2003  SPIE  ■  1605-7422/03/$  15. 00 


215 


more  subtle  cancers  [11,12],  hence,  more  subtle  cancers  will  be  considered  “visible”  or  detectable  on  routine 
mammograms.  In  such  a  changing  environment,  maintaining  “optimal”  performance  of  CAD  schemes  becomes  a 
challenge.  Although  CAD  schemes  can  detect  a  large  number  of  true-positive  abnormalities  (e.g.,  masses  and 
microcalcification  clusters)  depicted  on  prior  images  [7,12,13],  current  CAD  schemes  that  had  been  optimized  using  a 
large  fraction  of  “easy”  cancers  are  unlikely  to  achieve  “optimal”  performance  in  detecting  “earlier”  or  more  “subtle” 
cancers.  This  is  due  to  several  factors:  (1)  performance  of  CAD  schemes  that  use  a  feature-based  machine-learning 
classifier  heavily  depends  on  the  characteristics  of  training  database  [14,15]  and  (2)  a  large  number  of  image  features 
used  to  train  CAD  schemes  varies  differently  for  abnormalities  as  depicted  on  the  current  images  as  compared  with  prior 
images  [16].  Several  studies  have  demonstrated  that  in  order  to  achieve  optimal  performance  in  detecting  suspicious 
masses  as  depicted  on  prior  images,  a  different  set  of  image  features  should  be  selected  for  re-optimization  of  CAD 
schemes  [17,18]. 

In  previous  studies  [17,18]  optimal  performance  in  detecting  masses  depicted  on  prior  images  was  achieved  by  re¬ 
training  the  scheme  using  a  set  of  mass  regions  extracted  from  prior  images.  This  requires  a  significant  effort.  Since 
there  is  a  training  database  available  for  each  CAD  scheme,  this  database  could  potentially  be  used  to  re-optimize  the 
scheme  after  a  computational  adjustment  of  some  feature  values.  For  this  purpose,  we  investigated  a  new  method  to 
generate  a  simulated  training  database  and  used  it  to  re-optimize  our  CAD  scheme.  A  detailed  description  of  our 
approach  and  preliminary  experimental  results  follow. 


2.  MATERIALS  AND  METHODS 

From  an  image  database  established  in  our  laboratory,  we  selected  100  matched  pairs  of  digitized  mammograms  from 
two  consecutive  (the  most  recent  or  “cuirent”  and  the  latest  previous  or  “prior”)  examinations.  There  is  a  verified  mass 
region  depicted  in  each  case.  During  the  current  examination,  these  100  mass  regions  were  identified  by  radiologists  as 
suspicious  and  as  a  result  biopsies  were  ultimately  performed.  Although  in  a  retrospective  review  and  with  the  support  of 
available  source  documents,  an  experienced  observer  could  identify  some  indication  of  the  presence  of  a  “mass”  in  the 
corresponding  locations  on  prior  images,  these  regions  had  not  been  identified  as  suspicious  by  radiologists  during  the 
original  interpretation.  All  100  mass  regions  selected  for  this  study  were  associated  with  biopsy-proven  malignancies. 
The  locations  of  all  masses  depicted  on  current  images  and  the  corresponding  locations  on  prior  images  were  visually 
identified.  The  centers  (x,  y  coordinate)  of  all  verified  mass  regions  were  marked  manually  and  saved  in  a  reference  (or 
“truth”)  file. 

All  200  images  (100  from  current  and  100  from  prior  examination)  were  processed  by  a  CAD  scheme  developed 
previously  in  our  laboratory  [19].  To  detect  suspicious  masses,  each  image  is  first  subsampled  (pixel-averaged)  in  both 
dimensions  to  increase  pixel  size  from  original  50  pm  x  50  pm  (or  in  some  cases  100  pm  x  100  pm)  to  400  pm  x  400 
pm.  The  CAD  scheme  then  uses  three  stages  to  identify  suspicious  regions.  In  the  first  stage,  the  scheme  uses  image 
subtraction  and  threshold  results  after  processing  by  two  Gaussian  filters  with  a  large  difference  in  the  kernel  sizes  (7 
and  51  pixels)  to  search  for  the  initial  set  of  “suspicious”  regions,  which  usually  generates  in  the  range  of  10  to  30  initial 
“suspicious”  regions  per  image.  In  the  second  stage,  based  on  local  contrast  measurement  the  scheme  uses  an  adaptive 
region  growth  algorithm  to  define  three  topographic  layers.  After  simple  intra-layer  based  threshold  conditions  on 
growth  ratio  and  shape  factor,  this  stage  typically  eliminates  approximately  85%  of  regions  identified  in  stage  one,  while 
maintaining  a  very  high  sensitivity.  A  set  of  features  is  computed  for  each  detected  region.  During  stage  three  the 
detected  regions  are  classified  based  on  scores  (likelihood  of  being  true-positive)  generated  by  a  nonlinear  multi-layer 
feature-based  classifier  (e.g.,  an  artificial  neural  network)  [20],  To  determine  whether  a  detected  region  represents  a  true¬ 
positive  or  false-positive  mass  region  in  this  study,  the  following  criterion  was  used.  If  the  distance  between  the  center  of 
gravity  of  a  detected  region  and  the  center  of  the  mass  as  recorded  in  the  reference  file  was  shorter  than  the  radius  of  the 
longest  axis  of  the  detected  region,  it  was  considered  as  a  true-positive  identification.  Otherwise,  the  region  was 
considered  a  false-positive  identification.  In  this  experiment,  all  suspicious  mass  regions  identified  after  the  second  stage 
of  the  CAD  scheme  became  candidates  for  the  study  (namely,  the  classification  scores  in  the  third  stage  were  ignored). 
One  hundred  true-positive  mass  regions  from  current  images  and  100  mass  regions  from  prior  images  were  selected.  The 
CAD  scheme  detected  187  and  202  false-positive  mass  regions  in  the  current  and  prior  images  as  well.  From  these,  200 
false-positive  regions  were  randomly  selected  (100  from  current  images  and  100  from  prior  images).  Hence,  400 


216  Proc.  of  SPIE  Vol.  5032 


suspicious  mass  regions  were  selected  for  the  study.  The  regions  were  then  divided  (block  randomization)  into  training 
and  testing  datasets  for  both  current  and  prior  images.  Each  dataset  included  50  true-positive  and  50  false-positive  mass 
regions. 

For  each  region  the  following  five  features  were  computed: 


1.  Region  area  (Fj  =  0A6xNT ):  This  feature  is  computed  by  counting  the  number  of  pixels  in  the  growth 
region  ( NT )  and  then  multiplying  it  by  the  size  unit  of  each  pixel  (0.16  mm 2 ). 

1  &  1  "g, 

2.  Average  contrast  (F2  = - /J; - Ij  ):  This  feature  is  computed  by  the  average  pixel  value  (/) 

Nr  ;=i  N  B  j=l 

difference  between  the  growth  region  and  its  surrounding  background. 

„  Nc 

3.  Circularity  (r3  = - ):  To  compute  this  feature,  CAD  scheme  first  computes  the  area  of  a  growth  region 

Nt 

(  Nt  )  and  calculates  an  equivalent  circle  originating  at  the  center  of  gravity  of  the  region.  For  a  circle  with  the 
same  size  as  the  growth  region,  the  number  of  pixels  that  are  located  inside  the  growth  region  contour  and  the 
circle  ( Nc )  is  computed.  Circularity  is  defined  as  the  fraction  of  the  growth  region  pixels  covered  by  the 
circle. 


4.  Normalized  standard  deviation  of  radial  length  ( F4  = 


1 

V  Nb  i  mr 


)2  ):  The  radial  length  ri  is  defined 


as  the  distance  between  the  region  center  and  a  point  (i)  located  on  the  perimeter  of  the  region.  mr  is  the  mean 
value  of  radial  length  over  all  points  Nb  in  the  region  boundary.  This  feature  indicates  the  changes  in  the  shape 
of  region  boundary. 

„  F2 

5.  Conspicuity  ( r5  = - ):  This  feature  is  defined  as  “region  contrast”  (r2)  divided  by  “surrounding 


complexity”  ( CB );  where  CB  = -  and  Max(Ij  —  lF  )  is  the  maximum  pixel  value 

N  B  ,=i 

difference  between  background  pixel  (i)  and  its  neighboring  pixels  (e.g.,  24  pixels  in  a  5  x  5  square  window). 


Using  these  features,  three  artificial  neural  networks  (ANN)  were  constructed  to  classify  suspicious  regions.  The 
topology  of  all  ANNs  was  the  same.  It  involved  five  input  neurons  (each  represented  by  one  feature),  three  hidden 
neurons,  and  one  output  neuron.  The  ANN  was  trained  using  500  iterations.  The  training  momentum  and  learning  rate 
were  0.8  and  0.01,  respectively. 


ANN-1  and  ANN-2  were  trained  using  training  dataset  of  current  and  prior  images,  respectively.  ANN-3  was 
trained  using  a  set  of  simulated  “prior”  mass  regions.  To  generate  a  simulated  dataset,  we  computed  the  ratio  of  the 
average  feature  values  for  each  of  five  features  between  50  pairs  of  true-positive  mass  regions  as  extracted  from  current 
and  prior  images.  Ratios  were  computed  as  follows: 


Dk  = 


i  N 

1  Xf1  XT' Prior 

i  N 

1  V  Z7  Current 

nV,j 


k  =  1,2, 3,4, 5.  and  N  =  50. 
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Each  feature  of  true-positive  mass  region  in  the  current  training  dataset  was  then  multiplied  by  the  ratio,  such  as 
Fk  j  =  FkJrent  X  Dk .  Hence,  a  set  of  new  feature  values  was  generated  to  represent  each  of  50  “simulated  true¬ 
positive  mass  regions.”  Using  these  data  combined  with  feature  values  of  50  original  false-positive  regions  extracted 
from  the  current  images,  ANN-3  was  trained.  Although  the  50  simulated  mass  regions  (used  in  ANN-3)  and  50  original 
prior  mass  regions  (used  in  ANN-2)  have  identical  mean  values  for  each  of  the  five  features,  the  feature  values  for  a 

specific  region  are  different  (i.e.,  Fk  .  ^  F*jor  ,k  =  1,2,..., 5).  In  other  word,  the  simulated  set  of  “prior”  features 
does  not  simply  duplicate  the  actual  feature  set  in  prior  images. 

The  performances  of  three  ANNs  were  evaluated  separately  using  testing  datasets  of  50  current  and  50  prior  images. 
For  each  test  region,  the  ANN  generates  a  classification  score  ranged  from  0  to  1,  where  the  larger  the  score,  the  higher 
the  computed  likelihood  of  being  a  true-positive  mass  region.  The  classification  scores  generated  for  all  test  regions  were 
used  as  input  data  in  the  ROCFIT  program  that  generates  a  receiver  operating  characteristic  (ROC)  curve  and  computes 

the  area  under  the  ROC  curve  ( Az  value)  [21].  We  compared  performance  levels  when  using  the  three  ANNs  to  classify 
an  independent  set  of  suspicious  mass  regions  as  depicted  on  prior  images. 


3.  RESULTS 

Table  1  shows  the  averages  of  the  five  feature  values  in  the  two  training  datasets  extracted  from  the  current  and  prior 
images.  Using  paired  chi-square  test  to  examine  the  mean  values  of  each  of  the  five  features  between  50  pairs  of  training 
mass  regions,  the  significant  difference  (  p  <  0.05 )  was  found  in  the  average  value  of  each  of  the  five  features.  Table  2 

summarizes  the  areas  under  ROC  curves  (Az  values)  for  all  three  ANNs  during  training  and  testing.  Figure  1 
demonstrates  three  ROC  curves  generated  by  applying  three  ANNs  to  the  prior  testing  dataset.  ANN-1  yields  the  best 
performance  in  testing  current  dataset  ( Az  =  0.781  ±  0.019)  and  the  worst  performance  in  prior  testing  dataset  ( Az  = 
0.613  ±  0.026)  as  shown  in  table  2.  Both  ANN-2  and  ANN-3  yield  significantly  better  performance  than  ANN-1  in 
classifying  mass  regions  on  prior  testing  dataset  (p  <  0.05).  A,  values  were  increased  by  10.6%  (from  0.613  to  0.678)  in 
ANN-2  and  8.8%  in  ANN-3  (from  0.613  to  0.667),  respectively.  The  experimental  results  also  demonstrated  that  there 
was  no  significant  performance  difference  between  ANN-2  and  ANN-3  in  testing  prior  dataset  (p  =  0.15). 


Table  1:  Average  feature  values  and  their  difference  ratios  between  50  pairs  of  mass  regions  depicted  on  current  and 
prior  images. _ _ ■ 


Feature: 

Fi 

Fi  Fj 

F<  F5 

Average  value  (prior  images):  78.60 

Average  value  (current  images):  122.67 

Ratio:  0.70 

34.90  0.24 

42.68  0.21 

0.82  1.14 

0.78  4.25 

0.83  5.07 

0.94  0.84 

Table  2:  Areas  under  ROC  curves  ( A ,  values)  of  three  ANNs  during  training  and  testing. 

Network 

Training 

Testing  current  images 

Testing  prior  images 

ANN-1 

ANN-2 

ANN-3 

0.873  ±0.016 

0.761  ±0.021 

0.779  ±0.019 

0.781  ±0.019 

0.709  ±0.026 

0.736  ±  0.028 

0.613  ±0.026 

0.678  ±  0.029 

0.667  ±  0.029 
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Figure  1:  ROC  curves  of  testing  results  when  applying  three  ANNs  to  the  test  dataset  of  prior  images. 


4.  DISCUSSION 

With  improvements  of  diagnostic  technologies  and  increase  in  screening  compliance  of  the  general  population, 
radiologists  have  to  detect  increasingly  more  subtle  abnormalities  as  depicted  on  mammograms.  As  a  result,  CAD 
systems  that  currently  provide  satisfactory  cueing  results  could  face  deterioration  in  performance  over  time  due  to  a 
general  shift  in  the  subtleness  of  and  stage  at  detection.  Feature-based  machine  learning  classifiers,  such  as  ANNs,  are 
widely  used  in  final  stage  of  the  CAD  schemes  for  identifying  masses  and  microcalcification  clusters.  Since  these 
classifiers  are  trained  to  generate  “global”  functions  that  cover  the  entire  instance  space,  CAD  performances  heavily 
depend  on  the  training  databases  [22],  This  is  true,  in  particular,  in  mammography  where  the  size  and  diversity  of 
training  datasets  is  generally  limited  [14,15].  A  single  CAD  scheme  that  achieves  high  sensitivity  on  both  “subtle”  and 
relatively  “easy”  masses  at  an  acceptable  false-positive  rate  can  be  developed,  however,  in  reality,  it  is  a  very  difficult 
task  because  image  features  are  substantially  different  for  suspicious  mass  regions  extracted  from  the  current  and  prior 
images  [16,17].  In  order  to  improve  CAD  performance  in  detecting  subtle  masses  in  an  earlier  stage,  the  schemes  should 
be  trained  (or  optimized)  using  databases  involving  a  large  fraction  of  subtle  mass  regions  (e.g.,  new  cases  that  had  been 
rated  originally  as  negative  and  later  proven  to  be  positive)  [17,18], 

However,  it  is  a  very  difficult  and  time-consuming  task  to  collect  a  large  number  of  diverse  subtle  cases  (e.g.,  the 
false-negative  cases).  This  study  demonstrated  an  alternative  approach  to  collectively  simulate  such  cases.  By 
systematically  adjusting  the  feature  values  extracted  from  current  images,  we  generated  a  set  of  simulated  “prior”  mass 
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regions.  Our  results  demonstrated  that  (1)  an  ANN  trained  using  simulated  prior  mass  regions  could  achieve  significantly 
better  performance  in  detecting  the  masses  at  an  earlier  stage  than  an  ANN  trained  using  current  mass  regions  and  (2) 
there  is  no  significant  difference  in  the  performance  between  the  ANNs  trained  using  either  real  or  simulated  prior  mass 
regions.  As  a  result,  by  estimating  the  change  over  time  of  some  important  features,  one  can  adjust  CAD  performance 
for  better  detection  of  masses  at  an  earlier  stage.  Since  this  is  a  very  preliminary  study  involving  a  limited  database  and  a 
small  set  of  features,  the  concept  need  to  be  further  investigated.  If  this  approach  is  validated  with  significantly  larger 
image  databases  and  larger  number  of  features,  it  may  provide  a  simple  and  efficient  method  to  periodically  update  (or 
re-optimize)  CAD  schemes. 
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ABSTRACT 

Our  goal  was  to  develop  an  inexpensive,  high-quality,  multi-site  telemammography  system,  implemented  with  low- 
level  data  connections  that  provided  a  communication  link  for  an  “almost  real-time”  response  from  a  radiologist  (central 
site)  to  remote  “underserved”  sites.  The  remote  sites  digitize  mammographic  films  using  high-resolution,  laser 
digitizers.  Images  are  automatically  cropped,  compressed  (wavelet-based),  and  encrypted  prior  to  transmission.  At  the 
central  site  images  are  decrypted,  decompressed,  unsharp  masked,  and  displayed  using  automatically  determined  LUTs. 
The  sites  communicate  instantly  via  a  “chat  box.”  Remote  sites  1,  2,  and  3  are  15,  20,  and  90  miles  from  the  central 
site,  respectively,  and  connected  by  POTS  (sites  1  and  2)  and  LAN  (site  3).  Only  minimal  noticeable  difference  at 
compression  levels  of  50:1  and  75:1  could  be  identified  unless  magnified  to  extreme  levels.  Two  experienced  observers 
rated  the  LUTs  for  200  images  as  “acceptable”  to  “excellent.”  Average  cycle  times  to  digitize,  transmit  and  receive 
cases  (four  films  each)  at  75:1  compression  were  5.97,  6.85,  and  5.77  min/case  from  sites  1,  2,  and  3,  respectively. 
Unique  data-handling  schemes  significantly  decrease  the  image  file  size  and  allow  successful  transmission  in  a  reliable, 
timely  manner.  Over  1000  cases  have  been  transmitted  to  date.  Messaging  was  found  to  be  easy  to  use. 

Keywords:  Teleradiology,  breast  cancer  screening,  image  decision  making,  mammography. 

X.  INTRODUCTION 

The  benefits  of  breast  cancer  screening  mammography  of  asymptomatic  women  have  been  extensively  studied  and 
reported  in  the  recent  literature.1 6  Mammographic  screening  will  continue  to  be  widely  used  worldwide,  despite 
periodic  reports  of  limited  or  no  benefits  from  such  practices.7  9  Management  of  mammographic  screening  in  terms  of 
public  perception  and  compliance,10'12  radiologist’s  practice  and  performance,13 15  and  personnel  shortages1116  could  be 
improved  in  both  rural  and  urban  clinics.  The  use  of  teleradiology  is  one  approach  that  could  assist  in  this  regard. 

The  high-spatial  resolution  required  by  mammography  necessitates  the  use  of  commercial  digitizers  and  high-resolution 
monitors  to  sufficiently  preserve  image  quality.17  Transmission  time  of  large  amounts  of  mammographic  image  data 
(35-55  MBytes  per  image)  is  frequently  dependent  on  the  communication  link.  Low-level  data  connections  (i.e.,  Plain 
Old  Telephone  System  (POTS))  may  require  data  processing  to  decrease  the  image  file  size  to  enable  transmission  of 
large  amounts  of  data  in  a  timely  manner. 

This  manuscript  presents  preliminary  assessment  of  technical  and  operational  issues  regarding  a  multi-site 
telemammography  system  using  low-level  data  connections.  This  study  is  a  continuation  of  an  ongoing  effort  over  the 
past  several  years.1819  The  system  was  designed  on  the  concept  of  distributed  acquisition/centralized  review  and  to 
facilitate  communication  between  a  radiologist  at  a  central  site  and  a  technologist  at  a  remote  “underserved”  site.  For 
the  purpose  of  this  project,  “underserved”  means  a  location  where  a  physician  is  not  physically  present  when  the 
screening  examinations  are  conducted.  The  technical  features  described  were  designed  and  implemented  using  a  low- 
cost  approach  to  transmit  data  across  low-level  data  connections  in  a  timely  manner  and  maintain  a  high-level  of  image 
quality.  Issues  evaluated  included:  look-up  table  settings  (window  and  level),  image  cropping,  image  compression, 


*  drescherjm@msx.upmc.edu;  phone  (412)  641-2563;  fax  (412)  641-2582,  University  of  Pittsburgh,  Magee-Womens 
Hospital,  300  Halket  Street.,  Suite  4200,  Pittsburgh,  PA  15213 


Medical  Imaging  2003:  PACS  and  Integrated  Medical  Information  Systems:  Design  and  Evaluation, 
H.  K.  Huang,  Osman  M.  Ratib,  Editors,  Proceedings  of  SPIE  Vol.  5033  (2003) 

©2003  SPIE  •  1 605-7422/03/S1 5.00 


transmission  time,  and  workstation  display  features.  We  expect  to  demonstrate  that  the  combination  of  efficient  data 
handling,  intelligent  image  processing,  and  easy  to  use  messaging  can  be  implemented  to  produce  an  inexpensive,  high 
quality  telemammography  system  capable  of  an  “almost  real-time”  response  from  the  central  site  radiologist  to  remote 
site  technologist. 

2.  METHODS 


2.1  Central  and  remote  sites 

The  central  site  is  staffed  by  experienced  radiologists  and  located  at  Magee-Womens  Hospital,  Pittsburgh,  PA,  USA. 
The  telemammography  workstation  at  the  central  site  is  powered  by  a  dual  1.2  GHz  multi-processor  (Athlon  MP, 
Advanced  Micro  Device,  Sunnyvale  CA,  USA)  with  2  GB  of  RAM  operating  under  Microsoft  Windows  2000  Server 
(Microsoft  Corporation,  Redmond,  WA,  USA).  The  workstation  display  consists  of  three  high-resolution  (2048  x  2560) 
8-bit  grayscale  portrait  monitors  at  a  nominal  setting  of  80  ftL  (DS5100P,  Clinton  Electronics,  Rockford,  IL,  USA).  For 
data  communication,  the  workstation  uses  56K  hardware  modems  (U.S.  Robotics,  Rolling  Meadows,  IL,  USA)  and 
ethernet  network  cards  (OfficeConnect  10/100  NIC,  3COM,  Santa  Clara,  CA,  USA).  A  Kodak  Dryview  film  printer 
(Eastman  Kodak,  Rochester,  NY,  USA)  is  connected  to  the  workstation  for  film  printing  as  necessary  (Fig.  1). 
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The  remote  sites  are  staffed  by  mammography  technologists.  The  computer  hardware  at  the  remote  sites  operates  under 
Microsoft  Windows  2000  Workstation  powered  by  a  900  MHz  processor  (Athlon  900,  Advanced  Micro  Device, 
Sunnyvale  CA,  USA)  with  512  MB  of  RAM.  High-resolution,  laser  film  digitizers  (Lumiscan  85,  Eastman  Kodak, 
Rochester,  NY,  USA)  are  connected  to  the  remote  computers  via  SCSI  interface  and  equipped  with  a  film  feeder 
capable  of  holding  six  films  as  large  as  10  x  12  inches.  Mammographic  films  are  digitized  at  50  micron  pixel 
dimensions  and  12-bit  grayscale.  The  remote  site  computers  also  have  56K  hardware  modems  and  ethernet  network 
cards  (Integrated  PRO/100  S  Desktop  Adapter,  Intel  Corporation,  Santa  Clara,  CA,  USA)  for  data  communication. 
Prior  patient  reports  or  history  are  transmitted  along  with  the  images  by  inserting  them  into  an  attached  page  scanner  (hp 
scanjet  5490C.  Hewlett-Packard,  Palo  Alto,  CA,  USA).  Sites  1  and  2  transmit  data  across  Plain  Old  Telephone  System 
(POTS)  lines  and  are  located  15  and  20  miles  from  the  central  site,  respectively  (Fig.l).  Site  3  is  90  miles  from  the 
central  site  and  transmitted  data  across  a  Local  Area  Network  (LAN). 

2.2  Software  Design 

The  software  architecture  at  the  central  and  remote  sites  is  a  multithreading  design  that  allows  independent  task 
assignment  with  simultaneous  response  to  user  input.  A  message  dispatch  mechanism  synchronizes  bi-directional 
communication  between  all  the  main  threads,  except  for  the  Time  Manager  (Fig.  2).  The  Time  Manager  periodically 
dispatches  elapsed  time  messages  to  the  other  main  threads  without  receiving  messages.  Each  main  thread  acts  on  only 
messages  associated  with  its  function  and  may  spawn  subordinate  (worker)  threads  that  share  data  objects  to  accomplish 
tasks.  A  Reader/Writer  lock,  derived  from  Microsoft  Windows  synchronization  primitives,  prevents  corruption  of  the 
shared  data.  The  Reader/Writer  lock  permits  access  to  the  shared  data  to  any  number  of  readers  simultaneously. 

Central  site  main  threads: 

Time  manager  -  periodically  indicates  elapsed  time. 

Archive  manager  -  manages  disk  space  by  loading  images,  saving  images,  managing  cases,  and  deleting 

archived  cases  when  disk  space  is  limited. 

Case  manager  -  creates  cases,  assigns  data,  and  performs  database  functions. 

Display  manager  -  displays  images  and  forwards  messages  to  the  main  application  window. 

Distribution  manager  -  receives,  transmits,  and  processes  data. 

Remote  site  main  threads: 

Digitization  manager  -  manages  film  digitizing. 

Case  manager 

Display  manager 

Distribution  manager 


Fig.  2.  Main  threads  and  intra-process  communication.  Time  manager  does 
not  receive  messages. 
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2.3  Image  processing 

The  first  step  in  the  series  of  the  image  processing  procedures  is  designed  to  automatically  crop  each  image  to  decrease 
the  non-tissue  area  surrounding  the  breast  (Fig.  3).  The  automated  cropping  algorithm  begins  by  sub-sampling  the 
image  at  an  8:1  ratio.  The  standard  deviation  (STD)  of  a  7  x  7  pixel  mask  is  calculated  at  each  sub-sampled  pixel  (STD 
of  the  sub-sampled  image).  Next,  a  threshold  is  applied  to  the  STD  image  to  separate  tissue  and  non-tissue  regions 
where  a  high  STD  indicated  tissue  regions.  A  region  growing  algorithm  based  on  4-neighbor  connectivity  is  used  to 
identify  breast  tissue  as  the  largest  region  in  the  image.  Finally,  rudimentary  logic  is  used  to  determine  the  cropping 
parameters  based  on  the  orientation  of  the  tissue  regions  which  is  applied  to  the  original  image. 

Following  image  cropping,  the  image  data  are  compressed  using  the  irreversible  (lossy),  9/7  transform,  wavelet-based 
JPEG  2000  method.  Prior  to  transmission  from  the  remote  sites,  the  data  packets  are  encrypted  using  strong  128  bit 
Microsoft  Point-to-Point  Encryption  (MPPE)  with  version  2  authenticate  Microsoft  Challenge  Flandshake  Authenticate 
Protocol  (CHAP).  The  first  steps  at  the  central  site  are  decryption  and  decompression  of  the  image  data. 

Image  display  on  the  workstation  monitors  at  the  central  site  is  enhanced  by  minimal  unsharp  masking  of  the 
decompressed  image  data  prior  to  display.  To  begin  unsharp  masking,  the  image  data  are  first  smoothed  with  a  2-D  129 
mean  kernel.  The  weighted  (0.10)  smoothed  image  is  subtracted  from  the  decompressed  image.  The  resulting  pixel 
values  of  the  image  data  are  then  re-scaled  from  0  to  4095. 


To  minimize  the  need  for  manual  adjustment  during  image  viewing,  default  look-up  table  (LUT)  values  are 
automatically  calculated  based  on  the  pixel  value  distribution  (histogram).  The  typical  pixel  value  distribution  is 
bimodal.  The  window  value  (contrast)  is  set  as  the  span  of  the  two  modes,  and  the  level  value  (brightness)  is  set  as  the 
center  between  the  two  modes.  The  final  stage  of  the  image  processing  prior  to  image  display  is  to  pad  (fill)  the  images 
to  restore  the  full  height  of  the  image,  but  not  the  full  width  (Fig.  3). 


Fig.  3.  Data  flow  of  the  telemammography  system  illustrating  the  order  of  the  image  processing  tasks  and  where  (remote  or 
central)  the  process  is  performed. 


2.4  Workstation  display  functions  and  features 

To  allow  user-specific  preferences  to  be  used  during  case  review,  display  options  on  the  workstation  are  flexible  with 
all  features  being  mouse-driven.  The  default  display  is  left  and  right  craniocaudal  views  (LCC  &  RCC)  on  the  left 
monitor,  and  left  and  right  mediolateral  oblique  views  (LMLO  &  RMLO)  on  the  center  monitor  to  be  similar  to  our 
conventional  clinical  film  presentation  (Fig.  4).  However,  a  large  number  of  display  options  are  available  to  users.  If  a 
film  is  digitized  in  an  incorrect  orientation,  the  user  has  the  ability  to  flip  images  (top  to  bottom  or  left  to  right)  and 
rotate  images  180  degrees.  Communication  from  the  remote  site  is  displayed  on  the  right  monitor  (Fig.  4). 

Two  forms  of  image  magnification  are  available  on  the  workstation  display.  Typically,  the  normal  display  scale  with  a 
single  image  per  monitor  is  approximately  100  micron  pixel  dimensions  and  with  two  images  per  monitor  it  is 
approximately  200  micron  pixel  dimensions.  A  scrollable  image  magnification  box  provides  a  true  1:1  presentation 
(monitor  pixekdigitized  pixel)  resulting  in  50  micron  pixel  dimensions.  The  size  of  the  box  varies  from  511  x  566 
pixels  for  one  image/monitor,  and  408  x  566  pixels  for  two  images/monitor,  and  204  x  266  pixels  for  four 
images/monitor.  It  is  also  possible  to  pan  across  the  image  quadrant-by-quadrant. 
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Fig.  4.  Telemammography  workstation  at  the  central  site  pictured  in  the  default  image  display  format. 

The  automated  LUT  values  can  be  manually  adjusted  per  observer’s  preference.  The  window  and  level  values  are 
determined  based  on  the  mouse  position  (movement),  and  the  image  display  is  instantly  updated  as  the  mouse  is 
moved.  Once  the  desired  values  are  determined,  these  can  be  applied  to  the  individual  image  or  all  images  associated 
with  the  case.  The  LUT  values  can  be  reset  to  the  automated  (default)  values  at  anytime  during  viewing. 

2.5  Inter-site  communication 

To  facilitate  effective  communication  between  the  technologists  (remote  site)  and  radiologists  (central  site),  a  “chat 
box”  type  messaging  function  was  implemented.  The  “chat  message”  can  be  sent  with  each  case  and  it  provides  a  real¬ 
time,  interactive  communication  tool  between  the  sites.  During  the  initial  phase  of  evaluating  the  system, 
communication  is  performed  in  one  cycle.  The  technologist  sends  a  chat  message  with  each  case,  and  the  radiologist 
responds  directly  to  the  message.  The  chat  boxes  on  both  sides  contained  four  general  areas:  (1)  patient  demographics, 
(2)  message  display  area,  (3)  pull-down  menus,  and  (4)  free  text  area  (Fig.  5).  There  are  five  pull-down  menus  on  the 
technologist  chat  box  to  focus  communication  on  possible  actionable  items.  These  indicate:  (1)  breast:  left  or  right;  (2) 
view:  craniocaudal  and/or  mediolateral  oblique;  (3)  finding:  mass  or  calcifications;  (4)  comparison  with  prior  exam: 
baseline,  new,  or  change  in  findings;  and  (5)  possible  additional  procedure  needed:  additional  views  and/or  ultrasound. 
The  radiologists  can  reply  after  reviewing  each  case.  His/her  response  includes:  (1)  do  recommended  procedure  as 
suggested;  (2)  no  additional  procedures  necessary;  and  (3)  do  not  do  the  procedure  recommended,  but  do  X,  Y,  and  Z. 

2.6  Technical  and  operational  evaluation 

In  the  preliminary  technical  assessment  phase,  three  processes  of  the  telemammography  system  were  evaluated.  First  to 
assess  the  user’s  acceptance  of  the  automated  LUT  values  for  image  review  without  the  need  to  adjust  display 
parameters,  50  cases  (200  images)  sent  from  all  sites  were  subjectively  rated  by  two  experienced  observers  on  a  scale  of 
1  to  4.  The  experiment  was  designed  to  assess  acceptability  of  default  values  for  the  purpose  of  reviewing  each  case 
and  determining  the  need  (or  not)  for  additional  procedures.  In  all  of  our  studies  we  evaluated  the  system  under  normal 
operating  conditions.  As  a  result,  intra-  and  inter-site  measured  variability  reflect  what  could  be  expected  in  an  “on¬ 
line”  clinical  operation.  Second,  the  implementation  of  high-level  image  compression  in  mammographic  imaging  was 
evaluated  during  subjective  Just  Noticeable  Difference  (JND)  studies.  The  studies  compared  images  at  no  compression, 
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3.  RESULTS 

The  evaluation  of  the  technical  and  operational  processes  was  favorable  in  all  areas.  The  automated  LUT  settings,  the 
image  cropping,  the  high  level  of  image  compression,  and  the  cycle  time  to  transmit  and  receive  cases  were  all 
acceptable  for  implementation  of  the  telemammography  system  for  the  designed  purpose.  The  initial  impressions  of  the 
inter-site  communication,  “chat  messaging,”  indicate  that  it  can  facilitate  effective  communication  between  the 
technologist  at  remote  sites  and  the  radiologist  at  the  central  site.  Although  the  technical  issues  with  regard  to  scanning 
and  transmitting  patient  reports  with  each  case  have  been  resolved,  the  practice  of  has  not  been  implemented  to  date. 

The  automatically  calculated  LUT  settings  were  reported  as  “acceptable”  to  “excellent”  by  two  experienced 
mammography  researchers.  On  a  scale  of  1  to  4  (1  =  unusable,  2  =  need  minor  adjustments,  3  =  acceptable,  and  4  = 
excellent),  the  two  observers  had  mean  ratings  for  200  automatically  computed  LUT  settings  of  2.64  (STD  =  0.57)  and 
3.51  (STD  =  0.53).  After  minor  adjustments  were  made  as  the  result  of  the  above  experiment,  all  observers  including 
clinicians  using  the  workstation  to  test  different  aspects  of  the  system  accepted  automatically  set  values  in  over  90%  of 
cases.  Consequently,  window  and/or  level  manipulations  are  being  performed  in  less  than  10%  of  cases  during 
retrospective  and  simulated  prospective  case  reviews. 

For  review  of  non-magnified  or  moderately  magnified  images,  50:1  and  75:1  data  compression  levels  were  comparable 
and  acceptable  when  evaluated  on  either  laser-printed  films  or  the  telemammography  workstation.  Subjective  JND 
studies  were  conducted  using  laser-printed  films  as  well  as  images  displayed  side-by-side  on  workstation  monitors.  The 
studies  indicated  that  at  extreme  magnifications,  differences  were  detected,  but  did  not  necessarily  result  in  degradation 
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of  perceived  diagnostic  quality.  For  example,  the  “visibility”  and  “clarity”  of  microcalcifications  in  the  digital  images 
were  judged  as  “almost  equivalent”  between  the  full-scale,  non-compressed  images  and  images  compressed  at  a  75:1 
ratio  (Figs.  6  and  8).  Comparable  results  were  obtained  with  magnification  (Figs.  7  and  9).  The  automated  cropping 
did  not  remove  breast  tissue  in  any  of  our  cases  to  date,  and  it  produces  “aesthetically  pleasing”  images. 

The  time  to  transmit  and  receive  four  films  (8  x  10  inches  each)  was  reliably  less  than  7  minutes/case  for  each  site  using 
75:1  data  compression  (Table  1).  The  combination  of  image  cropping  and  75:1  data  compression  ratio  decreased  image 
file  size  to  allow  cycle  times  that  were  adequate  for  implementation  of  the  telemammography  concept  and  met  our 
planned  technical  specifications.  Sites  1  and  2  were  connected  via  56K  modems  that  dialed  a  four  digit  telephone 
number  (i.e.,  connected  via  an  in-house  telephone  line)  and  a  ten  digit  telephone  number  (i.e.,  connected  via  an  outside 
telephone  line),  respectively.  Consistent  bandwidths  of  sites  1  and  2  were  approximately  33  Kbits/second  and  21 
Kbits/second,  respectively.  The  digitization  process  (approximately  50  seconds/film)  was  the  limiting  factor  at  site  3 
which  was  connected  via  LAN.  Site  2  had  communication  problems  (decreased  bandwidth)  during  the  first 
measurement  that  have  been  largely  resolved. 

TABLE  1 


Experimentally  Measured  Average  Cycle  Time  for  Digitizing,  Transmitting  and  Receiving  a  Case  with  4 
Films  (8  x  10  inches  each) 


Image  format 

Site  1  -  POTS* 
(min/case) 

Site  2  -  POTS 
(min/case) 

Site  3  -  LAN 
(min/case) 

50: 1  compression,  not  cropped,  and  not  encrypted 

13.22 

24.42 

5.38 

50: 1  compression,  cropped,  and  encrypted 

6.47 

13.13 

5.65 

75:1  compression,  cropped,  and  encrypted 

5.97 

6.85 

5.77 

*in-house  POTS 


4.  DISCUSSION 

The  “proof  of  concept”  to  design  an  inexpensive,  high-quality,  multi-site  telemammography  system  implemented  with 
low-level  data  connections  has  been  established  to  facilitate  the  concept  of  “almost  real-time”  distributed 
acquisition/centralized  review.  The  technical  feasibility  of  the  concept  was  demonstrated  by:  (1)  the  digitization  of 
films  acquired  during  clinical  breast  cancer  screening  mammography;  (2)  the  timely  transmission  of  the  digitized 
images  across  low-level  data  connections  (less  than  7  minutes/case);  and  (3)  the  efficient  archiving,  retrieving,  and 
viewing  of  image  data  at  the  central  site.  The  short  cycle  time  of  the  system  was  realized  because  of  the  image  file  size 
reduction  due  to  automated  image  cropping  and  image  data  compression  and  the  efficient  multi-tasking  software 
approach  based  on  a  synchronized  multi-threading  design.  Image  processing  methods  were  fundamental  to  the  success 
of  the  telemammography  system.  The  automated  cropping  and  compression  produced  images  without  a  significant 
degradation  of  the  diagnostic  image  quality,  which  were  well-received  by  the  radiologists.  Although  the  automated 
window  and  level  calculations  were  found  to  be  acceptable,  in  approximately  ten  percent  of  cases,  radiologists  manually 
employed  window  and  level  settings  during  an  individual  case  review.  The  high-resolution  image  display  of  the 
telemammography  workstation  was  rated  acceptable  for  reviewing  screening  mammographic  images  for  the  purpose  of 
determining  the  need  for  additional  procedures. 

To  date,  over  1000  screening  exams  have  been  successfully  transmitted  using  the  telemammography  system.  The 
preliminary  results  suggest  that  the  telemammography  system  could  accomplish  the  goals  to  increase  effective 
communication  between  remote  “underserved”  sites  and  the  central  location,  and  permit  experienced  radiologists  to 
remotely  monitor  and  facilitate  some  decision  making  while  the  patient  remains  in  the  clinic.  The  addition  of  two  key 
components  to  the  telemammography  system  should  improve  the  system’s  capability  and  effective  utilization.  First, 
scanned  prior  patient  reports  will  be  added  to  the  information  transmitted  with  each  case.  Second,  Computer  Aided 
Detection  (CAD)  schemes  will  be  incorporated  into  the  system  and  the  results  will  be  displayed  at  the  central  site. 
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Original  Image 


Fig.  6.  Original  left  medial  lateral  oblique  image  of  patient  #569. 
Image  is  not  cropped,  compressed,  or  unsharp  masked. 
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Mammography  with 
Computer-aided  Detection: 
Reproducibility  Assessment — 
Initial  Experience1 

PURPOSE:  To  examine  the  performance  and  reproducibility  of  a  commercially 
available  computer-aided  detection  (CAD)  system  with  a  set  of  mammograms 
obtained  in  100  patients  who  had  undergone  biopsy  after  positive  findings  at 
mammography. 

MATERIALS  AND  METHODS:  One  hundred  positive  mammographic  examina¬ 
tions  (four  views  each),  depicting  96  masses  and  50  microcalcification  clusters,  were 
scanned  and  analyzed  three  times  by  the  CAD  system.  Reproducibility  of  detection 
sensitivity  and  the  individual  CAD-generated  cues  in  the  three  images  were  exam¬ 
ined.  Both  abnormality-  and  region-based  detection  sensitivities  were  compared. 

RESULTS:  Forty-eight  (96.0%)  of  50  microcalcification  clusters  were  marked  on  all 
three  images  in  the  abnormality-based  analysis.  Of  the  remaining  two  clusters,  one 
was  marked  in  two  images  and  one  was  marked  in  only  one.  The  abnormality-based 
sensitivity  for  mass  detection  ranged  from  66.7%  (64  of  96)  to  70.8%  (68  of  96). 
The  system  generated  identical  patterns  (including  images  with  and  those  without 
cues)  for  all  three  images  in  53.3%  (21 3  of  400)  of  images.  For  true-positive  cluster 
regions,  88.9%  (80  of  90)  were  marked  at  the  same  location  in  all  images.  For 
true-positive  mass  regions,  69.5%  (82  of  1 1 8)  were  marked  at  the  same  locations  in 
all  images.  In  false-positive  detections,  only  44.0%  (81  of  1 84)  of  false-positive  mass 
regions  and  31 .9%  (38  of  11 9)  of  false-positive  cluster  regions  were  marked  at  the 
same  locations  on  all  three  images. 

CONCLUSION:  Reproducibility  of  marked  regions  generated  by  the  CAD  system  is 
improved  from  that  reported  previously,  largely  as  a  result  of  the  substantial  reduc¬ 
tion  in  the  false-positive  detection  rates.  Reproducibility  of  true-positive  identifica¬ 
tion  of  masses  remains  an  important  issue  that  may  have  methodologic  and  clinical 
practice  implications. 
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Mammography  is  a  common  and  effective  method  with  which  to  screen  for  early  detec¬ 
tion  of  breast  cancer,  to  interpret  mammograms,  and  particularly  to  identify  subtle  masses 
and  microcalcification  clusters  surrounded  by  complex  breast  tissue  patterns,  but  it  is  a 
difficult  and  time-consuming  task.  Findings  in  studies  show  that  from  10%  to  30%  of 
breast  cancers  that  are  visible  on  mammograms  during  retrospective  readings  are  missed 
during  the  original  interpretations  for  various  reasons  (1-3).  One  well-documented 
method  to  reduce  false-negative  rates  in  mammography  is  the  use  of  an  independent 
double-reading  approach  (4,5).  However,  this  approach  is  both  inefficient  and  costly.  As  a 
result,  after  intensive  research  and  substantial  improvements  in  the  past  2  decades, 
computer-aided  detection  (CAD)  systems  have  been  developed  to  provide  radiologists  with 
a  "second  opinion"  when  they  identify  suspicious  regions  for  masses  or  microcalcification 
clusters.  In  the  current  study,  we  used  one  of  three  commercially  available  CAD  systems 
that  have  been  approved  by  the  U.S.  Food  and  Drug  Administration  and  are  used  for  this 
purpose. 

Because  of  the  potential  importance  of  CAD  systems  in  the  clinical  environment,  several 
studies  (6-10)  have  been  conducted  recently  to  evaluate  the  performance  of  CAD  systems 
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alone  and  their  possible  effect  on  diag¬ 
nostic  performance  of  radiologists  under 
a  variety  of  clinical  conditions.  In  one 
recent  study  involving  12,860  patients  in 
a  community  breast  center,  use  of  CAD 
resulted  in  a  19.5%  increase  in  the  num¬ 
ber  of  cancers  detected  without  undue 
effect  on  the  recall  rate  (from  6.5%  to 
7.7%)  (6).  In  another  large  retrospective 
study,  a  false-negative  rate  of  21%  was 
found  when  14  radiologists  interpreted 
mammograms,  and  the  CAD  system  cor¬ 
rectly  marked  77%  of  these  missed  cases 
(7).  Thus,  researchers  claim  that  CAD 
cueing  could  potentially  reduce  this 
false-negative  rate  by  as  much  as  77% 
without  an  increase  in  the  recall  rate  (8). 
On  the  other  hand,  findings  in  a  different 
study  showed  that  despite  high  (and  clin¬ 
ically  viable)  sensitivity,  the  CAD  system 
had  no  effect  on  radiologist  performance 
(including  sensitivity  and  specificity)  (9). 
These  researchers  suggested  that  perhaps 
the  many  false-positive  markings  influ¬ 
enced  the  radiologists  not  to  have  suffi¬ 
cient  confidence  in  the  CAD  results  to 
alter  their  original  interpretations  (9).  Re¬ 
sults  in  another  retrospective  study  dem¬ 
onstrated  that  the  performance  of  a  CAD 
system  could  affect  the  performance  of 
radiologists  in  the  detection  of  masses 
and  microcalcification  clusters.  Highly 
performing  CAD  schemes  with  high  sen¬ 
sitivity  and  a  low  false-positive  rate  could 
improve  radiologists'  performance  signif¬ 
icantly,  while  poorly  performing  CAD 
schemes  could  significantly  (P  <  .01)  de¬ 
crease  readers'  performance  (10). 

An  important  issue  related  to  the  use  of 
CAD  is  the  reproducibility  of  results.  In 
one  study,  an  early  version  of  Image- 
Checker  (R2  Technology,  Los  Altos,  Calif) 
was  evaluated,  and  the  authors  suggested 
that  its  reproducibility  may  be  insuffi¬ 
cient  for  the  routine  clinical  environ¬ 
ment  (11).  Recently,  a  new  version  of  the 
software  was  used,  which  improves  the 
detection  sensitivity  and  specificity  (12). 
In  the  version  used  in  the  current  study 
(ImageChecker,  version  2.0),  the  stated 
detection  sensitivity  for  the  cancer  cases 
was  increased  from  83.7%  to  90.4%  (in¬ 
cluding  an  increase  in  mass  detection 
from  74.7%  to  85.7%  and  an  essentially 
unchanged  performance  for  microcalcifi¬ 
cation  detection  of  more  than  98%).  At 
the  same  time,  the  false-positive  rate  was 
reduced  substantially  from  approximately 
1.0  per  image  to  0.5  per  image  (or  4.1- 
2.06  false-positive  cues  per  four  views  in 
true-negative  cases)  (12).  The  purpose  of 
our  study  was  to  examine  the  perfor¬ 
mance  and  reproducibility  of  a  commer¬ 
cially  available  CAD  system  by  using  a  set 


of  mammograms  acquired  in  100  pa¬ 
tients  who  had  undergone  biopsy  after 
positive  findings  at  mammography. 

MATERIALS  AND  METHODS 

Cases 

During  the  past  several  years,  a  large  da¬ 
tabase  (>1,000  cases)  of  digitized  images 
and  associated  diagnostic  results  has  been 
established  and  managed  in  our  laboratory 
under  an  approved  institutional  review 
board  protocol  (informed  consent  was 
waived).  For  the  purpose  of  third-party,  we 
asked  a  staff  member  not  otherwise  related 
to  this  current  investigation  to  randomly 
select  100  mammographic  cases  (four 
views  each)  from  the  biopsy  records  of  our 
institution  during  the  years  1999-2001. 
We  requested  that  25  of  the  cases  depict 
microcalcification  clusters  and  75  cases  de¬ 
pict  masses  as  a  primary  detection  finding. 
At  least  two-thirds  of  the  cases  were  to  be 
selected  from  those  proven  to  be  malig¬ 
nant.  With  the  exception  of  these  condi¬ 
tions,  cases  were  selected  solely  by  the  staff 
member  from  the  biopsy  records.  The  se¬ 
lection  process  did  not  involve  a  previous 
review  of  any  of  the  images.  Therefore, 
there  was  no  preselection  (and  potentially 
biasing)  process  as  related  to  the  average 
tissue  density  or  the  subtlety  of  the  abnor¬ 
malities  depicted  in  the  images. 

Each  case  could  involve  one  or  more  ab¬ 
normalities  (mass,  microcalcification  clus¬ 
ter,  or  both).  In  these  100  cases,  51  de¬ 
picted  only  masses  (43  depicted  one  mass 
and  eight  depicted  two  masses),  12  de¬ 
picted  only  microcalcification  clusters  (1 1 
depicted  one  cluster  and  one  depicted  two 
clusters),  and  37  depicted  both  masses  and 
clusters  (one  mass  and  one  cluster).  There 
were  no  cases  with  more  than  three  abnor¬ 
malities  depicted.  The  data  set  involved  96 
verified  masses  and  50  verified  microcalci¬ 
fication  clusters.  Sixty-five  of  the  96  masses 
were  malignant,  and  31  were  benign.  Thirty- 
one  of  the  50  microcalcification  clusters 
were  associated  with  malignancy,  and  19 
were  benign.  By  examining  all  source 
documents  (including  pathology  re¬ 
ports),  the  locations  of  all  abnormalities 
were  specified  by  radiologists. 

CAD  Evaluation 

These  400  images  were  scanned  through 
the  CAD  system  three  times  within  a  pe¬ 
riod  of  3  weeks.  After  digitization  and  com¬ 
putation,  suspicious  masses  and  microcal¬ 
cification  clusters  identified  by  the  CAD 
system  were  marked  on  the  output  paper 
images  by  using  the  standard  identification 
scheme.  The  CAD  system  does  not  outline 


the  entire  mass  region  or  individual  micro¬ 
calcifications  in  a  cluster,  only  a  small  star 
or  a  triangle  is  superimposed  on  the  image 
to  indicate  the  presence  of  a  suspicious  re¬ 
gion  for  a  mass  or  a  cluster,  respectively. 
The  boundaries  of  masses  and  clusters  were 
identified  visually  on  the  images  by  a  re¬ 
searcher  (B.Z.),  who  consulted  with  radiol¬ 
ogists  in  cases  of  ambiguity.  If  the  star  was 
located  anywhere  inside  a  true-positive 
mass  region  in  the  image,  this  mass  was 
considered  to  be  identified  correctly  by  the 
CAD  system.  Similarly,  as  long  as  a  triangle 
was  overlapping  any  of  the  microcalcifica¬ 
tion  areas,  the  mark  was  considered  to  rep¬ 
resent  a  true-positive  detection.  Otherwise, 
the  cue  was  considered  to  identify  a  false¬ 
positive  region.  The  processing  of  each  case 
resulted  in  three  sets  of  output  images. 

Data  Analysis 

The  sensitivity,  false-positive  rate,  and 
reproducibility  of  the  CAD  system  with 
these  100  cases  (or  400  images)  were  ana¬ 
lyzed  for  abnormality-  and  region-based 
values.  In  the  abnormality-based  analysis, 
the  sensitivity  is  assessed  on  the  basis  of 
the  correct  marking  of  at  least  one  true¬ 
positive  region  in  either  view  (craniocau- 
dal,  mediolateral  oblique,  or  both),  which 
included  96  masses  (65  malignant)  and  50 
calcifications  (31  malignant)  in  the  100 
cases.  In  cases  with  more  than  one  abnor¬ 
mality,  each  was  considered  to  be  indepen¬ 
dent  of  the  others.  In  the  region-based 
analysis,  the  abnormality  depicted  in  each 
view  (either  craniocaudal  or  mediolateral 
oblique)  was  considered  an  independent 
true-positive  finding.  Sensitivity  was  then 
computed  on  the  basis  of  the  number  of 
correctly  detected  true-positive  regions 
(rather  than  abnormalities).  This  approach 
included  292  positive  findings — namely, 
96  masses  and  50  clusters,  each  visible  on 
two  views.  To  compare  the  differences  in 
proportions  of  correctly  detected  abnor¬ 
malities  among  replicated  images,  the  pair¬ 
wise  McNemar  test  was  applied  to  the  data 
set. 

RESULTS 


Tables  1  and  2  summarize  the  performance 
of  the  CAD  system  with  respect  to  mass 
and  microcalcification  cluster  detection  in 
each  of  the  three  scans.  Abnormality-based 
sensitivity  for  mass  detection  ranged  from 
66.7%  (64  of  96)  to  70.8%  (68  of  96).  Al¬ 
though  scan  2  yielded  highest  sensitivity 
for  mass  detection  (68  of  96),  scan  1  de¬ 
picted  the  highest  number  of  malignant 
masses  (47  of  65).  For  microcalcification 
cluster  detection,  48  of  50  clusters  were 
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TABLE  1 

Mass  Detection  Performance  of  CAD  System  during  Each  Scan 


Sensitivity 

Sensitivity  (all  cases)  (malignant  cases  only) 


Scan 

No. 

Abnormality 
Based  (%) 

Region  Based 
(%) 

Abnormality 
Based  (%) 

Region  Based 
(%) 

False-Positive  Rate 
per  Image 

1 

69.8 

52.1 

72.3 

54.6 

0.33 

(67  of  96) 

(100  of  192) 

(47  of  65) 

(71  of  130) 

(1 30  of  400) 

2 

70.8 

52.6 

70.8 

52.3 

0.33 

(68  of  96) 

(101  of  192) 

(46  of  65) 

(68  of  1  30) 

(1 31  of  400) 

3 

66.7 

51.0 

69.2 

51.5 

0.31 

(64  of  96) 

(98  of  192) 

(45  of  65) 

(67  of  1  30) 

(125  of  400) 

TABLE  2 

Microcalcification  Cluster  Detection  Performance  of  CAD  System 
during  Each  Scan 


Sensitivity 

Sensitivity  (all  cases)  (malignant  cases  only) 


Scan 

No. 

Abnormality 
Based  (%) 

Region  Based 
(%) 

Abnormality 
Based  (%) 

Region  Based 
(%) 

False-Positive  Rate 
per  Image 

1 

96.0 

85.0 

93.5 

85.5 

0.17 

(48  of  50) 

(85  of  100) 

(29  of  31) 

(53  of  62) 

(69  of  400) 

2 

98.0 

87.0 

96.8 

87.1 

0.19 

(49  of  50) 

(87  of  100) 

(30  of  31) 

(54  of  62) 

(77  of  400) 

3 

100 

86.0 

100 

87.1 

0.20 

(50  of  50) 

(86  of  1 00) 

(31  of  31) 

(54  of  62) 

(79  of  400) 

TABLE  3 

Number  of  Times  a  Mass  (or  a  Region)  was  Detected 

No.  of 
Times 
Detected 

True-Positive 

Masses 

Malignant 

Masses 

True-Positive 

Mass 

Regions 

Malignant 

Mass 

Regions 

False-Positive 

Mass 

Regions 

Total  Marked 
Mass  Regions 

3  (%) 

58 

41 

82 

58 

81 

163 

(77.3) 

(78.8) 

(69.5) 

(71.6) 

(44.0) 

(54.0) 

2  (%) 

8 

4 

17 

8 

40 

57 

(10.7) 

(7.7) 

(14.4) 

(9.9) 

(21.7) 

(18.9) 

1  (%) 

9 

7 

19 

14 

63 

82 

02.0) 

(13.5) 

(16.1) 

(17.3) 

(34.3) 

(27.1) 

Total 

75 

52 

118 

81 

184 

302 

TABLE  4 

Number  of  Times  a  Microcalcification  Cluster  (or  a  Region)  was  Detected 

No.  of 
Times 
Detected 

True-Positive 

Clusters 

Malignant 

Clusters 

True-Positive 

Cluster 

Regions 

Malignant 

Cluster 

Regions 

False-Positive 

Cluster 

Regions 

Total  Marked 
Cluster 
Regions 

3  (%) 

48 

29 

80 

50 

38 

118 

(96.0) 

(93.5) 

(88.9) 

(89.3) 

(31.9) 

(56.5) 

2  (%) 

1 

1 

8 

5 

30 

38 

(2.0) 

(3.2) 

(8.9) 

(8.9) 

(25.2) 

(18.2) 

1  (%) 

1 

1 

2 

1 

51 

53 

(2.0) 

(3.2) 

(2.2) 

(1.8) 

(42.9) 

(25.3) 

Total 

50 

31 

90 

56 

119 

209 

detected  by  the  CAD  system  on  all  three 
images.  Two  malignant  clusters  were 
missed  in  two  of  three  scans  (scans  1  and 
2),  and  one  of  these  clusters  was  missed  in 


scan  2.  With  the  pairwise  McNemar  test, 
no  significant  ( P  >  .3)  differences  were 
found  in  the  detection  results  between  any 
pair  of  the  three  scans. 


For  region-based  sensitivity,  mass  de¬ 
tection  ranged  from  51.0%  (98  of  192) 
to  52.6%  (101  of  192).  The  total  num¬ 
ber  of  masses  detected  ranged  from  98 
to  101  in  each  of  the  three  scans.  How¬ 
ever,  the  actual  difference  in  the  indi¬ 
vidual  mass  regions  detected  was  larger. 
For  example,  scan  1  depicted  100  re¬ 
gions  and  scan  2  depicted  101  regions. 
However,  only  88  of  these  regions  were 
detected  in  both  images.  For  the  detec¬ 
tion  of  microcalcification  clusters,  the 
region-based  sensitivity  ranged  from 
85.0%  (85  of  100)  to  87.0%  (87  of  100) 
for  individual  cluster  regions  and  from 
85.5%  (53  of  62)  to  87.1%  (54  of  62)  for 
malignant  clusters. 

Although  Tables  1  and  2  show  that  the 
total  number  of  regions  detected  in  this 
set  of  images  is  relatively  constant  with 
all  three  scans,  the  locations  of  the  re¬ 
gions  detected  (in  particular,  false-posi¬ 
tive  regions)  could  differ  from  scan  to 
scan.  In  213  of  400  images,  the  output 
results  for  all  three  scans  were  identical, 
which  represents  an  overall  reproducibil¬ 
ity  of  53.3%.  Among  these  images,  37.6% 
(80  of  213)  had  no  cues  (including  nei¬ 
ther  true-positive  nor  false-positive  cues) 
in  all  three  scans.  For  the  remaining  320 
images,  the  CAD  system  marked  511  re¬ 
gions  (1.6  cues  per  image)  in  three  scans 
(including  true-positive  cues).  Of  these 
511  marked  regions,  281  were  identified 
on  all  three  scans  (55%  region-based  re¬ 
producibility). 

Tables  3  and  4  summarize  the  num¬ 
ber  of  true-positive  and  false-positive 
masses  and  microcalcification  clusters 
(including  both  abnormalities  and  re¬ 
gions)  that  were  identified  in  all  three 
scans,  two  scans,  or  only  one  scan.  The 
results  show  that  the  reproducibility  for 
the  true-positive  regions  (those  identi¬ 
fied  in  all  three  scans)  is  substantially 
higher  than  that  for  the  false-positive 
regions.  For  the  true-positive  mass  re¬ 
gions,  the  CAD  system  generated  118 
cues  in  three  scans,  and  82  (69.5%)  of 
them  were  marked  at  the  same  loca¬ 
tions.  For  the  true-positive  cued  cluster 
regions,  88.9%  (80  of  90)  of  cues  were 
in  the  same  locations  for  all  three  scans. 
On  the  other  hand,  the  reproducibility 
of  the  false-positive  cues  was  much 
lower,  with  a  higher  fraction  of  differ¬ 
ent  cues  being  generated  in  each  scan. 
Only  44.0%  (81  of  184)  of  the  false¬ 
positive  mass  regions  and  31.9%  (38  of 
119)  of  the  false-positive  microcalcifica¬ 
tion  cluster  regions  were  marked  at  the 
same  locations  in  all  three  scans. 
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DISCUSSION 


In  a  previous  study,  38.5%  (77  of  200)  of 
images  had  CAD  cues  that  were  located 
congruently  in  all  three  scans  (11).  In  the 
current  study,  the  CAD  system  generated 
identical  results  on  53.0%  (213  of  400)  of 
the  images.  The  improvement  in  repro¬ 
ducibility  may  be  largely  a  result  of  the 
substantial  decrease  in  the  false-positive 
detection  rate  (from  approximately  1.0  to 
0.5  per  image)  (12).  When  we  exclude  80 
images  that  had  no  CAD  cues,  the  repro¬ 
ducibility  in  the  remaining  320  images  was 
reduced  to  41.6%  (133  of  320).  However, 
the  reproducibility  in  detecting  specific 
true-positive  masses  and  microcalcification 
clusters  is  perhaps  more  important  than 
the  more  general  image-based  reproduc¬ 
ibility.  It  is  generally  difficult  to  directly 
compare  the  detection  performance  in 
two  experiments,  because  different  image 
databases  were  used  and  the  results  de¬ 
pend  heavily  on  the  difficulty  of  the  se¬ 
lected  cases  (13).  However,  some  compar¬ 
ative  information  can  be  ascertained.  In  a 
previous  report,  the  CAD  system  per¬ 
formed  better  for  mass  detection  (86.9% 
abnormality-based  sensitivity)  than  for  mi¬ 
crocalcification  cluster  detection  (76.6%) 
(11),  while  in  the  current  study,  sensitivity 
for  the  detection  of  microcalcification 
clusters  was  higher  than  96%,  and  the 
sensitivity  for  mass  detection  was  in  the 
range  of  70%.  These  results  may  indicate 
that  the  microcalcification  clusters  de¬ 
picted  in  our  data  set  were  easier  to  de¬ 
tect,  and  masses  depicted  in  our  database 
were  more  subtle.  The  case  selection  pro¬ 
tocol  we  used  should  have  reduced  bi¬ 
ases;  however,  the  results  presented 
herein  with  a  small  database  may  not 
represent  the  actual  performance  of  the 
system  in  a  clinical  setting.  Findings  in 
the  current  study  demonstrated  clearly 
that  the  issue  of  reproducibility  of  image- 
based  CAD  systems  needed  to  be  investi¬ 
gated  further. 

It  should  be  noted  that  we  obtained 
somewhat  different  results  in  absolute 
terms  for  the  benign  and  malignant 
cases,  but  the  pattern  for  the  two  groups 
remained  similar.  All  cases  in  our  study 
were  sufficiently  suspicious  to  ultimately 
warrant  a  recommendation  for  biopsy. 
We  believe  that  at  this  stage,  CAD 
schemes  should  be  designed  and  opti¬ 
mized  to  identify  this  group  of  cases,  in¬ 
cluding  those  that  ultimately  prove  to  be 
benign.  It  is  well  known  that  repeated 
scanning  of  the  same  image  results  in  a 
slightly  different  digital  value  matrix  for 
a  variety  of  technical  reasons.  In  current 


CAD  systems,  a  binary  threshold  is  typi¬ 
cally  used  to  generate  detection  marks. 
Each  marked  region  has  a  computed 
score  that  is  above  a  predetermined 
threshold;  hence,  lesions  with  computed 
scores  that  are  near  the  threshold  are  vul¬ 
nerable  to  small  changes  and  may  be  de¬ 
tected  in  one  image  and  missed  in  an¬ 
other.  Findings  in  the  present  study  show 
that  the  reproducibility  of  false-positive 
cues  was  much  lower  than  that  of  true¬ 
positive  cues  (Tables  3  and  4),  because 
the  detection  scores  may  be  close  to  the 
threshold.  We  did  not  perform  a  com¬ 
plete  long-term  follow-up  to  confirm  that 
all  false-positive  cues  actually  repre¬ 
sented  negative  regions.  Should  any 
false-positive  detection  prove  to  be  a  true 
abnormality,  the  computed  reproducibil¬ 
ity  level  would  be  lower  than  that  re¬ 
ported  herein. 

Note  that  the  databases  used  in  this 
and  a  previous  (11)  study  were  small; 
hence,  the  results  may  not  represent  the 
actual  reproducibility  of  CAD  systems  in 
the  screening  environment.  Despite  this 
limitation,  findings  in  the  two  studies 
highlight  an  important  finding.  Current 
CAD  schemes  are  sensitive  to  small  vari¬ 
ations  in  the  digital  value  matrices  that 
result  from  repeated  scanning  of  the 
same  images.  This  may  have  method- 
ologic  and  clinical  practice  implications 
that  need  to  be  addressed.  The  fact  that 
all  abnormalities  depicted  in  the  present 
study  were  visible  on  both  views  indi¬ 
cates  that  the  cases  were  not  particularly 
subtle  and  that  the  findings  we  report 
herein,  including  possible  implications, 
may  be  magnified  in  cases  that  are  more 
difficult  to  identify  visually  or  when  the 
abnormality  is  visible  only  on  one  view. 
We  suspect  that  this  sensitivity  to  minor 
changes  in  the  matrices  is  not  unique  to 
the  CAD  system  evaluated  in  the  current 
study.  Full-field  digital  mammography 
systems  are  rapidly  becoming  available 
(14,15).  By  definition,  once  an  image  is 
acquired,  the  CAD  detection  result  will 
be  100%  reproducible  when  the  same 
CAD  scheme  is  applied  repeatedly  to 
such  an  image.  To  be  optimal,  however, 
current  CAD  schemes  may  have  to  be 
reengineered  and  reoptimized  by  using 
digitally  acquired  images  before  these 
schemes  can  be  applied  optimally  to  full- 
field  digital  mammography  systems.  An 
investigation  on  possible  effects  of  re¬ 
peated  image  acquisition  of  the  same 
breast  on  CAD  results  is  beyond  the  scope 
of  the  present  study. 

Findings  in  our  preliminary  study  sug¬ 
gest  that  sensitivity  for  the  detection  of 
microcalcification  clusters  is  high;  as  a 


result,  reproducibility  is  also  high.  These 
results  are  achieved  at  a  low  false-positive 
detection  rate;  hence,  it  is  a  useful  tool 
during  the  diagnostic  process.  Our  results 
raise  the  important  question  about  the 
possible  need  to  maintain  records  of  CAD 
cues  as  available  during  the  interpreta¬ 
tion  of  the  individual  cases.  This  may 
become  an  even  more  important  issue  as 
cancer  detection  continues  to  progress 
toward  an  earlier  stage  (hence,  a  more 
subtle  appearance)  on  the  average.  De¬ 
tailed  documentation  of  all  available  in¬ 
formation  at  the  time  of  diagnosis  is  not 
always  done,  particularly  since  informa¬ 
tion  is  often  provided  verbally.  In  the 
case  of  screening  mammographic  inter¬ 
pretation,  however,  the  presence  of  a  ma¬ 
lignancy  that  was  visible  (in  retrospect) 
on  a  previous  mammogram  and  in  which 
a  follow-up  scan  of  the  original  images  in 
a  CAD  system  may  produce  a  true-posi¬ 
tive  identification,  could  present  a  medi¬ 
colegal  problem.  It  will  be  difficult  to  ar¬ 
gue  that  the  abnormality  in  question  was 
not  identified  as  suspicious  on  the  origi¬ 
nal  image.  Findings  in  our  preliminary 
study  suggest  that  this  may  be  the  case  in 
a  noticeable  fraction  of  mass  cases  (ap¬ 
proximately  20%,  as  shown  in  Table  3). 

The  current  practice  associated  with 
the  use  of  CAD  in  the  mammographic 
environment  is  not  clear  on  whether  a 
record  of  the  CAD  results  used  during  the 
case  interpretation  should  be  retained. 
Until  mass  detection  is  substantially  im¬ 
proved,  results  in  our  study  suggest  that 
such  a  practice  should  be  considered.  In¬ 
terestingly,  although  largely  impractical, 
our  study  findings  clearly  suggest  that  at 
this  level  of  performance,  multiple  re¬ 
peated  scans  of  each  case  could  be  ac¬ 
quired  to  improve  the  performance  of 
CAD  schemes. 
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Performance  Change  of  Mammographic 
CAD  Schemes  Optimized  with  Most- 
Recent  and  Prior  Image  Databases1 

Bin  Zheng,  PhD,  Walter  F.  Good,  PhD,  Derek  R.  Armfield,  MD,  Cathy  Cohen,  MD 
Todd  Hertzberg,  MD,  Jules  H.  Sumkin,  DO,  David  Gur,  ScD 


Rationale  and  Objectives.  The  authors  evaluated  performance  changes  in  the  detection  of  masses  on  “current”  (latest) 
and  “prior”  images  by  computer-aided  diagnosis  (CAD)  schemes  that  had  been  optimized  with  databases  of  current  and 
prior  mammograms. 

Materials  and  Methods.  The  authors  selected  260  pairs  of  matched  consecutive  mammograms.  Each  current  image  de¬ 
picted  one  or  two  verified  masses.  All  prior  images  had  been  interpreted  originally  as  negative  or  probably  benign.  A 
CAD  scheme  initially  detected  261  mass  regions  and  465  false-positive  regions  on  the  current  images,  and  252  corre¬ 
sponding  mass  regions  (early  signs)  and  471  false-positive  regions  on  prior  images.  These  regions  were  divided  into  two 
training  and  two  testing  databases.  The  current  and  prior  training  databases  were  used  to  optimize  two  CAD  schemes  with 
a  genetic  algorithm.  These  schemes  were  evaluated  with  two  independent  testing  databases. 

Results.  The  scheme  optimized  with  current  images  produced  areas  under  the  receiver  operating  characteristic  curve  of 
0.89  ±  0.01  and  0.65  ±  0.02  when  tested  with  current  images  and  prior  images,  respectively.  The  scheme  optimized  with 
prior  images  produced  areas  under  the  receiver  operating  characteristic  curve  of  0.81  ±  0.02  and  0.71  ±  0.02  when  tested 
with  current  images  and  prior  images,  respectively.  Performance  changes  for  both  current  and  prior  testing  databases  were 
significant  (P  <  .01)  for  the  two  schemes. 

Conclusion.  CAD  schemes  trained  with  current  images  do  not  perform  optimally  in  detecting  masses  depicted  on  prior 
images.  To  optimize  CAD  schemes  for  early  detection,  it  may  be  important  to  include  in  the  training  database  a  large 
fraction  of  prior  images  originally  reported  as  negative  and  later  proven  to  be  positive. 

Key  Words.  Breast  neoplasms,  diagnosis;  breast  radiography;  computers,  diagnostic  aid. 
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Mammography  is  considered  the  most  reliable  and  cost- 
effective  screening  method  for  the  early  detection  of 
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breast  cancers,  which  could  lead  to  early  treatment  and 
substantially  reduce  associated  mortality  and  morbidity 
( 1 ,2).  The  large  volume  of  mammograms  obtained  and 
the  low  cancer  detection  rates  in  a  mammographic  screen¬ 
ing  environment  could  result  in  radiologists  missing  as 
many  as  10%— 30%  of  cancers  rated  “visible”  during  ret¬ 
rospective  reviews  (3,4).  To  assist  radiologists  in  detect¬ 
ing  more  cancers  at  screening,  computer-aided  detection 
(CAD)  systems  are  being  used  in  many  medical  institu¬ 
tions  around  the  world  (5,6).  A  number  of  studies  have 
been  conducted  to  assess  their  possible  effect  on  radiolo¬ 
gists’  performance.  Although  there  is  no  general  agree¬ 
ment  on  whether  and  how  CAD  systems  help  radiologists 
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improve  their  diagnostic  accuracy  (7,8),  a  number  of  stud¬ 
ies  have  demonstrated  that  the  performance  of  the  particu¬ 
lar  CAD  scheme  (including  sensitivity,  false-positive  rate, 
and  reproducibility)  could  be  important  in  this  regard 

(9-11). 

Current  guidelines  recommend  periodic  mammographic 
screening  for  women  over  age  40  years  (12).  As  compli¬ 
ance  increases  in  the  general  population,  a  large  fraction 
of  patients  will  have  undergone  a  series  of  mammo¬ 
graphic  examinations.  As  more  of  the  most  easily  de¬ 
tected  cancers  are  identified  during  the  initial  examination 
with  the  incorporation  of  CAD  into  the  diagnostic  pro¬ 
cess,  detected  breast  cancers  will  be  shifted,  on  average, 
toward  an  earlier  stage.  In  other  words,  more  subtle  can¬ 
cers  will  be  considered  visible  or  detectable  on  routine 
mammograms.  This  will  occur  also  in  part  because  of  the 
availability  of  previous  images  for  comparison,  which 
could  help  radiologists  detect  more  subtle  cancers  (13,14). 
In  this  changing  environment,  it  is  not  clear  whether  cur¬ 
rent  CAD  schemes  optimized  with  a  large  number  of  eas¬ 
ily  detected  cancers  are  best  suited  for  the  detection  of 
earlier  or  more  subtle  cancers.  This  may  become  an  im¬ 
portant  issue  in  developing  and  evaluating  new  CAD 
schemes.  In  our  experiment,  an  artificial  neural  network 
(ANN)  previously  used  in  our  own  CAD  scheme  for  mass 
detection  was  reoptimized  separately  by  means  of  mass 
regions  depicted  on  “current”  images  (from  the  most  re¬ 
cent  examination,  at  which  the  mass  was  actually  re¬ 
ported,  leading  to  biopsy)  and  those  depicted  on  the  cor¬ 
responding  “prior”  images  (originally  interpreted  as  nega¬ 
tive).  Hence,  two  different  schemes  were  used.  The 
changes  in  their  performance  were  then  evaluated  when 
they  were  applied  to  independent  sets  of  cases  with 
masses  depicted  on  both  current  and  prior  images. 


MATERIALS  AND  METHODS 


We  searched  our  database  for  verified  cases  in  which 
both  current  and  prior  images  had  been  collected  and  dig¬ 
itized.  Inclusion  criteria  required  that  at  least  one  mass 
had  been  identified  by  a  radiologist  on  the  current  images 
and  that  biopsy  had  been  performed  as  a  result.  In  addi¬ 
tion,  during  a  retrospective  review  and  with  the  support 
of  available  source  documents,  an  experienced  observer 
(B.Z.)  had  to  be  able  to  identify  a  mass  at  the  correspond¬ 
ing  locations  on  the  prior  images.  In  each  case,  the  most 
recent  prior  image  had  been  interpreted  as  negative  or 
“not  highly  suspicious.” 


As  a  result,  134  cases  were  selected  for  this  study.  The 
mass  was  visible  on  both  views  (craniocaudal  and  medio- 
lateral  oblique)  in  126  cases  and  on  only  one  view  in 
eight  cases.  Hence,  260  pairs  of  images,  with  each  pair 
consisting  of  one  current  image  and  one  prior  image, 
were  included  in  the  study.  On  these  images  270  distinct 
mass  regions  were  identified  (10  images  depicted  two 
mass  regions),  220  of  which  were  associated  with  biopsy- 
proved  malignancy  (50  were  benign).  The  locations  of  all 
masses  depicted  on  current  images  and  the  corresponding 
regions  on  prior  images  were  visually  identified  as  con¬ 
firmed  by  the  diagnostic  reports  and  pathology  results. 

The  centerpoint  (x,y  coordinate)  of  each  verified  mass 
region  was  marked  manually  and  saved  in  a  reference  (or 
“troth”)  file. 

All  520  images  (260  current  and  260  prior)  were  pro¬ 
cessed  by  a  CAD  scheme  developed  previously  in  our 
laboratory  to  identify  and  classify  suspicious  regions  (15). 
The  scheme  includes  three  stages.  First,  it  uses  image 
subtraction  and  threshold  results  after  processing  by  two 
Gaussian  filters  with  a  large  difference  in  kernel  sizes  (7 
pixels  and  5 1  pixels)  to  search  for  the  initial  set  of  suspi¬ 
cious  regions,  a  process  that  usually  results  in  the  identifi¬ 
cation  of  10-30  suspicious  regions  per  image.  In  the  sec¬ 
ond  stage,  on  the  basis  of  local  contrast  measurement,  the 
scheme  uses  an  adaptive  region  growth  algorithm  to  de¬ 
fine  three  topographic  layers  for  each  region.  Through  the 
imposition  of  threshold  conditions  of  growth  ratio  and 
shape  factor  for  each  layer  in  the  regions  identified  as 
potential  lesions,  this  stage  eliminates  approximately  85% 
of  identified  regions  from  consideration,  while  maintain¬ 
ing  high  sensitivity.  A  set  of  features  is  computed  for 
each  detected  region.  During  the  third  stage,  the  remain¬ 
ing  regions  are  classified  according  to  scores  generated  by 
a  nonlinear  multilayer  feature-based  classifier,  defining 
the  likelihood  of  there  being  true-positive  findings  in 
those  regions  (16). 

In  this  experiment,  all  remaining  regions  identified  as 
suspicious  mass  regions  after  the  second  stage  of  the 
CAD  scheme  were  selected  for  further  consideration  (the 
classification  scores  in  the  third  stage  were  ignored).  As  a 
result,  726  suspicious  regions  on  the  260  current  images 
and  723  suspicious  regions  on  the  260  prior  images  were 
selected.  If  the  location  of  a  selected  region  matched  that 
of  a  verified  mass,  the  region  identification  was  consid¬ 
ered  true-positive.  Specifically,  the  distance  between  the 
center  of  gravity  of  a  region,  as  detected  automatically  by 
the  CAD  scheme,  and  the  center  of  the  mass,  as  recorded 
in  the  reference  file,  had  to  be  shorter  than  the  radius  of 
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MAMMOGRAPHIC  CAD  OPTIMIZATION 


Number  of  Suspicious  Mass  Regions  in  Each  Data  Set 


Training  Data  Set 

Testing  Data  Set 

True- 

False- 

True- 

False- 

Images 

Positive 

Positive 

Positive 

Positive 

Current 

131  (103) 

233 

130  (108) 

232 

Prior 

126  (100) 

236 

126  (104) 

235 

Note. — Numbers  in  parentheses  indicate  the  regions  associated 
with  malignant  masses. 


the  longest  axis  of  the  detected  region.  Otherwise,  the 
region  was  considered  a  false-positive  identification. 

The  locations  of  261  of  the  726  selected  regions  on 
current  images  matched  those  of  verified  masses,  com¬ 
pared  with  252  of  the  723  regions  on  the  prior  images. 

All  true-positive  and  false-positive  regions  were  then  ran¬ 
domly  divided  into  four  mutually  exclusive  data  sets,  two 
for  current  images  and  two  for  prior  images.  To  minimize 
potential  bias,  true-positive  regions  of  the  same  mass  (de¬ 
picted  on  craniocaudal  and  mediolateral  oblique  views) 
were  assigned  to  the  same  data  set  (either  training  or  test¬ 
ing),  and  when  a  mass  region  was  assigned  to  the  training 
(or  testing)  subset  in  current  images,  its  corresponding 
regions  as  depicted  on  prior  images  were  also  assigned  to 
the  training  (or  testing)  subset.  The  Table  summarizes  the 
number  and  distribution  of  true-positive  regions  and  false¬ 
positive  regions  in  each  of  the  four  data  sets. 

Training  data  sets  from  the  current  and  prior  images 
were  used  to  optimize  two  feature-based  ANNs  indepen¬ 
dently  as  substitutes  for  the  third  stage  in  our  CAD 
scheme  (16).  Previous  studies  have  demonstrated  that  the 
feature  distributions  were  different  for  mass  regions  de¬ 
picted  on  current  images  and  those  depicted  on  prior  im¬ 
ages  and  that  different  feature  sets  should  be  used  for 
optimal  classification  results  (17,18).  Therefore,  we  ap¬ 
plied  a  genetic  algorithm  to  search  separately  for  optimal 
sets  of  features  on  current  images  and  on  prior  images, 
using  the  genetic  algorithm  software  and  optimization  proto¬ 
col  that  had  been  used  in  our  previous  studies  to  optimize 
both  Bayesian  belief  networks  (19)  and  ANNs  (20). 

In  brief,  a  binary  coding  method  is  applied  to  create  a 
chromosome  used  in  the  genetic  algorithm.  Each  ex¬ 
tracted  feature  corresponds  to  a  gene  (that  is,  either  to  0 
or  to  1).  To  determine  the  optimal  number  of  neurons  in 
the  second  (hidden)  layer  of  the  ANN,  we  include  four 
additional  genes  in  the  chromosome.  Hence,  the  chromo¬ 
some  has  a  fixed  length  of  40  genes,  of  which  the  first  36 
represent  extracted  image  features  and  the  last  four  indi¬ 


cate  the  binary-coded  number  of  hidden  neurons  (eg, 

0101  is  the  code  for  five  hidden  neurons)  (20).  To  set  up 
initial  parameters  in  the  genetic  algorithm  software,  we 
included  a  population  size  of  100  and  assigned  the  cross¬ 
over  rate,  the  mutation  rate,  and  the  generation  gap  to  0.6, 
0.001,  and  1.0,  respectively.  To  minimize  overfitting  and 
increase  robustness  of  the  ANN  performance,  we  adopted 
a  limited  number  of  training  iterations  (1,000),  as  well  as 
a  large  ratio  between  the  momentum  (0.8)  and  learning 
rate  (0.01)  in  the  ANN.  The  output  of  the  ROCFIT  soft¬ 
ware  program  (University  of  Chicago,  Ill)  (21)  was  inter¬ 
faced  with  the  fitness  function  of  the  genetic  algorithm, 
and  Az  values  computed  by  the  program  were  defined  as 
fitness  criteria  in  the  genetic  algorithm.  The  genetic  algo¬ 
rithm  was  terminated  when  it  either  converged  to  the 
“highest”  Az  value  (with  no  further  improvement  accom¬ 
plished  in  the  new  generation)  or  reached  a  predetermined 
number  of  generations  (eg,  100). 

Using  this  approach,  we  generated  two  optimal  ANNs, 
each  using  a  different  training  data  set.  ANN-1  was 
trained  with  the  suspicious  mass  regions  extracted  solely 
from  the  current  images,  and  ANN-2  was  trained  with 
regions  extracted  solely  from  the  prior  images.  Then  we 
applied  each  of  the  ANNs  to  the  two  mutually  exclusive 
testing  data  sets  of  regions  extracted  from  both  current 
and  prior  images.  The  classification  scores  in  each  test 
were  used  to  generate  four  receiver  operating  characteris¬ 
tic  (ROC)  curves.  The  four  Az  values  were  compared.  We 
defined  the  threshold  as  a  false-positive  detection  rate 
similar  to  that  of  the  leading  commercial  CAD  products — 
approximately  0.4  false-positive  mass  regions  per  image 
(7).  At  this  level,  we  found  the  corresponding  detection 
sensitivity  levels  and  computed  the  expected  number  of 
detected  true-positive  regions  (130  in  the  data  set  of  cur¬ 
rent  images,  and  126  in  that  of  prior  images).  Thus,  we 
compared  the  change  in  expected  true-positive  detection 
levels  with  the  use  of  ANN-1  and  ANN-2  for  current  and 
prior  images  at  an  operating  point  currently  accepted  in 
clinical  CAD. 


RESULTS 


From  the  genetic  algorithm  and  training  data  sets  of 
current  images  and  of  prior  images,  two  optimal  ANNs 
were  generated.  ANN-1  included  13  features,  and  ANN-2 
included  11  (Fig  1);  four  features  were  common  to  both. 
Many  of  the  features  are  not  orthogonal,  which  is  not 
unique  to  our  scheme.  The  highest  Az  values  achieved  for 
the  training  data  sets  were  0.92  ±  0.01  for  ANN-1  and 
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Figure  1.  Features  selected  by  means  of  the 
genetic  algorithm  for  ANN-1  and  ANN-2. 
Those  in  boldface  are  common  to  both  ANNs. 
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background 

6. 

Contrast  (3rd  layer) 

6. 

Region  perimeter  divided  by  size 
(3rd  layer) 

7. 

Standard  deviation  of  radial 

7. 

Standard  deviation  of  radial  length 

length  (3rd  layer) 

(3rd  layer) 

8. 

Circularity  (3rd  layer) 

8. 

Circularity  (3rd  layer) 

9. 

Ratio  between  the  maximum  and 

9. 

Skewness  of  pixel  values  of 

minimum  radial  lengths  (3rd  layer) 

background 

10. 

Difference  of  minimum  pixel 

10. 

Average  local  pixel  value  fluctuation 

values  inside  and  outside  of  the 

(within  a  5  x  5  frame)  of  the  segmented 

growth  region  (3rd  layer) 

breast  area 

11. 

Region  conspicuity  (3rd  layer) 

11. 

Region  conspicuity  (3rd  layer) 

12. 

Standard  deviation  of  pixel  values 
(3rd  layer) 

13. 

Standard  deviation  of  pixel  values 
in  the  segmented  breast  area 

0.76  ±  0.02  for  ANN-2.  When  ANN-1  was  applied  to  the 
testing  data  sets,  the  Az  values  were  0.89  ±  0.01  and 
0.65  ±  0.02  for  current  and  prior  images,  respectively. 
Figure  2  shows  three  ROC  curves  for  training  and  two 
testing  results.  When  ANN-2  was  applied  to  the  same 
data  sets,  the  Az  values  were  0.81  ±  0.02  for  current  and 
0.71  ±  0.02  for  prior  images.  Figure  3  shows  the  corre¬ 
sponding  ROC  curves  for  ANN-2. 

The  test  results  differed  significantly  (P  <  .01)  be¬ 
tween  ANN-1  and  ANN-2  for  both  the  current  and  prior 
image  testing  data  sets.  As  shown  in  Figure  4,  Az  values 
were  reduced  by  9.0%  (from  0.89  with  ANN-1  to  0.81 
with  ANN-2)  for  the  current  testing  data  set  and  increased 
by  9.2%  (from  0.65  to  0.71)  for  the  prior  testing  data  set. 
In  addition,  at  an  operating  point  of  0.4  false-positive  de¬ 
tections  per  image,  the  sensitivity  levels  represented  by 
the  two  ROC  curves  in  Figure  2  are  0.82  and  0.40.  In 
Figure  3,  the  corresponding  sensitivity  levels  are  0.68  and 
0.52.  If  we  convert  these  levels  to  an  expected  number  of 
detected  true-positive  mass  regions,  ANN-1  would  detect 
1 8  additional  mass  regions  in  the  current  testing  data  set, 
while  ANN-2  would  detect  15  additional  mass  regions  in 
the  prior  testing  data  set. 

The  results  are  not  substantially  different  when  be¬ 
nign  masses  are  excluded  from  the  analysis.  ANN-1 


False-Positive  Rate  (1  -  Specificity) 


Figure  2.  ROC  curves  showing  the  performance  of  ANN-1  during 
training  with  the  current  image  data  set  (O)  and  during  testing  with 
the  current  image  data  set  (A)  and  the  prior  image  data  set  (■). 

yielded  performance  levels  (Az)  of  0.88  ±  0.02  and 
0.63  ±  0.02  for  current  and  prior  images,  respectively; 
the  comparable  values  for  ANN-2  were  0.81  ±  0.02 
and  0.70  ±  0.03. 
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Figure  3.  ROC  curves  showing  the  performance  of  ANN-2  dur¬ 
ing  training  with  the  prior  image  data  set  (O)  and  during  testing 
with  the  current  image  data  set  (A)  and  the  prior  image  data 
set  (■). 


DISCUSSION 


Feature-based  machine  learning  classifiers,  such  as 
ANNs,  are  widely  used  in  CAD  schemes  as  a  final  stage 
in  identifying  and  classifying  abnormalities.  Since  these 
classifiers  are  trained  to  generate  a  “global”  function  to 
cover  the  entire  instance  space  (22),  their  performance 
depends  heavily  on  the  training  databases.  This  is  particu¬ 
larly  true  in  mammography,  for  which  the  size  and  diver¬ 
sity  of  training  data  sets  are  often  limited  (23,24).  Opti¬ 
mal  feature  sets  such  as  those  selected  by  the  genetic  al¬ 
gorithm  could  differ  for  different  limited-size  training 
databases.  Hence,  the  features  selected  in  this  study  for 
the  current  images  were  very  similar  but  not  identical  to 
those  selected  in  our  previous  studies  (16,18).  A  single 
CAD  scheme  that  achieves  high  sensitivity  for  both  subtle 
and  relatively  easy-to-detect  masses  at  an  acceptable 
false-positive  rate  can  be  developed  if  a  large  and  diverse 
image  database  is  available.  However,  the  creation  of 
such  a  database  is  very  difficult,  because  image  features 
(including  texture-  and  morphology-based  features)  are 
substantially  different  for  suspicious  mass  regions  ex¬ 
tracted  from  current  and  prior  images,  as  previous  studies 
have  demonstrated  (17,18). 

The  CAD  scheme  trained  with  the  current  image  data 
set  did  not  perform  optimally  when  tested  with  the  prior 
image  data  set,  and  vice  versa.  On  the  one  hand,  it  is  im¬ 


Figure  4.  Differences  in  area  under  the  ROC  curve  (AJ  for 
ANN-1  and  ANN-2  when  tested  with  the  current  image  data  set 
(A)  and  the  prior  image  data  set  (■). 


portant  for  a  CAD  scheme  to  detect  more  subtle  masses, 
because  most  radiologists  can  identify  the  easily  detected 
ones.  On  the  other  hand,  users  may  lose  confidence  in  a 
scheme  if  it  frequently  misses  masses  that  should  be  easy 
to  detect.  Without  such  confidence,  radiologists  will  most 
likely  be  reluctant  to  accept  CAD  cuing  on  subtle  masses 
or  make  any  changes  in  their  initial  interpretation  (8), 
preventing  the  full  benefit  of  CAD  schemes  from  being 
realized  in  clinical  environments.  When  ANN-2,  which 
had  been  trained  with  the  prior  image  data  set,  was  tested 
with  the  current  image  data  set,  the  testing  results  were 
better  (higher  Az)  than  the  training  results,  demonstrating 
the  general  robustness  of  the  scheme  (Fig  3). 

Like  most  commercially  available  CAD  systems,  our 
CAD  scheme  was  designed  to  detect,  not  classify,  suspi¬ 
cious  abnormalities.  Therefore,  we  believe  that  the 
scheme  should  be  highly  sensitive  to  all  suspicious  mass 
regions  considered  “actionable”  by  radiologists  (eg,  rec¬ 
ommended  for  follow-up  or  biopsy),  even  if  some  regions 
later  prove  benign.  One  of  our  previous  studies  suggested 
that  radiologists’  performance  in  classifying  abnormalities 
as  benign  or  malignant  was  not  affected  by  the  perfor¬ 
mance  of  CAD  cuing  for  detection  purposes  (1 1).  In  any 
event,  the  inclusion  of  the  benign  mass  regions  as  true¬ 
positive  cases  in  this  experiment  did  not  affect  our  results 
and  conclusions. 

With  improvements  in  diagnostic  technology  and  in¬ 
creasing  compliance  with  screening  recommendations 
among  women  generally,  radiologists  have  to  detect  in¬ 
creasingly  subtle  abnormalities  depicted  on  mammograms. 
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As  a  result,  the  performance  of  a  CAD  system  that  ini¬ 
tially  provided  satisfactory  cuing  results  when  optimized 
could  deteriorate  substantially  over  time.  Therefore,  it 
may  be  beneficial  to  update  training  data  sets  periodically 
and  reoptimize  the  schemes  by  using  a  large  fraction  of 
new  cases  originally  rated  negative  and  later  found  posi¬ 
tive.  An  alternative  approach  could  be  to  provide  two 
types  of  cues,  one  trained  with  current  and  one  with  prior 
images  (“early  signs”).  We  believe  that  our  experimental 
results  are  not  unique  to  our  own  image  database,  our 
CAD  scheme,  or  ANN-based  CAD  schemes  but  should 
apply  to  all  types  of  CAD  schemes  in  which  feature- 
based  machine  learning  classifiers  are  used. 
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ABSTRACT 

This  study  was  designed  to  evaluate  radiologists’  ability  to  identity  highly-compressed,  digitized  mammographic 
images  displayed  on  high-resolution,  monitors.  Mammography  films  were  digitized  at  50  micron  pixel  dimensions 
using  a  high-resolution  laser  film  digitizer.  Image  data  were  compressed  using  the  irreversible  (lossy),  wavelet-based 
JPEG  2000  method.  Twenty  images  were  randomly  presented  in  pairs  (one  image  per  monitor)  in  three  modes:  mode  1, 
non-compressed  versus  50:1  compression;  mode  2,  non-compressed  versus  75:1  compression;  and  mode  3,  50:1  versus 
75:1  compression  with  20  random  pairs  presented  twice  (80  pairs  total).  Six  radiologists  were  forced  to  choose  which 
image  had  the  lower  level  of  data  compression  in  a  two-alternative  forced  choice  paradigm.  The  average  percent  correct 
across  the  six  radiologists  for  modes  1,  2  and  3  were  52.5%  (+/-11.3),  58.3%  (+/-14.7),  and  58.3%  (+/-7.5), 
respectively.  Intra-reader  agreement  ranged  from  10  to  50%  and  Kappa  from  -0.78  to  -0.19.  Kappa  for  inter-reader 
agreement  ranged  from  -0.47  to  0.37.  The  “monitor  effect”  (left/right)  was  of  the  same  order  of  magnitude  as  the 
radiologists’  ability  to  identify  the  lower  level  of  image  compression.  In  this  controlled  evaluation,  radiologists  did  not 
accurately  discriminate  non-compressed  and  highly-compressed  images.  Therefore,  75:1  image  compression  should  be 
acceptable  for  review  of  digitized  mammograms  in  a  telemammography  system. 

Keywords:  Image  compression,  data  compression,  JPEG  2000,  telemammography 


1.  INTRODUCTION 

Breast  cancer  screening  mammography  is  widely  practiced  and  increasingly  challenging  to  manage  in  the  clinical 
environment,  but  there  is  potential  for  improvement.1'7  Teleradiology  is  an  approach  that  may  provide  more  timely 
patient  management.  Image  compression,8'13  image  cropping,12'14  and  image  selection15  are  commonly  used  in 
teleradiology  to  facilitate  the  timely  transmission  of  data.  The  high-spatial  resolution  required  for  mammography 
complicates  the  design  and  implementation  of  a  telemammography  system.  The  large  mammographic  image  file  size 
(33-55  MBytes  per  image)  is  one  obstacle  to  timely  transmission  of  data,  especially  across  low-level  data  connections. 
High-level  image  compression  may  assist  in  overcoming  this  obstacle  and  can  only  be  realized  with  lossy  image 
compression  techniques,  which  necessitates  the  loss  of  some  image  information  and  a  degree  of  image  degradation. 

The  use  of  high-level  image  compression  in  medical  applications  is  frequently  met  with  skepticism  because  of  the 
potential  degradation  of  the  depiction  of  objects  under  investigation.  Human  observer  performance  studies  designed  to 
evaluate  wavelet  compression  of  medical  images  for  clinical  applications  have  reported  acceptable  compression  levels 
ranging  from  8:1  to  100:1. 16  26  Wavelet-based  compression,  the  trend  in  medical  image  compression,  is  reported  to  be 
superior  to  the  original  JPEG  compression  based  on  the  direct  cosine  transform  in  terms  of  image  quality  at  high-levels 
of  image  compression.16,17 


*  jklst3@pitt.edu;  phone  (412)  641-2572;  fax  (412)  641-2582,  University  of  Pittsburgh,  Magee-Womens  Hospital,  300 
Halket  Street,  Suite  4200,  Pittsburgh,  PA  15213 


Medical  Imaging  2004:  Image  Perception,  Observer  Performance,  and  Technology  Assessment, 
edited  by  Dev  P.  Chakraborty,  Miguel  P.  Eckstein,  Proceedings  of  SPIE  Vol.  5372 
(SPIE,  Bellingham,  WA,  2004)  •  1605-7422/04/$  15  ■  doi:  10.1117/12.533201 


From  our  perspective  the  effect  of  image  degradation  from  lossy  compression  of  medical  image  interpretation  remains 
unresolved,  particularly  regarding  mammography.  Observer  studies  reported  that  8:122  and  1 0: 1 27  compression  ratios 
are  acceptable  for  mammography  applications  using  both  wavelet  and  the  original  JPEG  compression  methods. 
Visualization  of  calcifications  depicted  on  digitized  mammograms  was  subjectively  rated  as  excellent  for  wavelet 
compression  ratios  as  high  as  56:1. 19  Uncompressed  digitized  mammographic  images  were  rated  to  be  comparable  to 
images  compressed  at  30:1  using  wavelet  compression.20  These  studies  are  indeed  promising,  and  high-levels  of  image 
compression  may  be  ultimately  clinically  acceptable  in  mammography. 

Powell  et  al.22  (2000)  conducted  a  clinical  evaluation  that  compared  film  mammography  to  digitized  images  compressed 
at  8:1  using  wavelet  based  compression.  The  accuracy  for  detecting  malignancy  was  not  statistically  different  when 
depicted  on  film  or  digitized  images  in  a  receiver  operating  characteristics  (ROC)  study.  The  false  positive  rate  at  a 
fixed  sensitivity  of  0.90  was  significantly  lower  (better)  using  digitized  images  as  compared  with  film.  Compressed 
digitized  images  were  also  slightly  better  (though  not  statistically)  than  film  in  terms  of  recall  rate  for  negative 
mammograms  and  those  depicting  benign  findings.  The  recall  rate  for  mammograms  depicting  malignant  abnormalities 
was  slightly  better  (though  not  statistically)  when  original  films  were  used  as  compared  with  digitized  images. 

The  objective  of  this  study  was  to  determine  an  acceptable  level  of  image  compression  in  a  telemammography 
application.  The  ability  of  radiologists  to  discriminate  high-levels  of  image  compression  as  applied  to  digitized 
mammograms  was  evaluated.  Image  pairs  of  different  compression  levels  were  randomly  presented  and  viewed  side- 
by-side  on  two  high-resolution  monitors.  Six  radiologists  were  forced  to  choose  the  lower  level  of  image  compression 
and  rate  the  relative  utility  of  the  images  for  use  in  a  screening  mammography  environment. 


2.  METHODS 


2.1  Case  selection 

This  study  used  twenty  breast  cancer  screening  examinations  randomly  selected  from  a  larger  telemammography 
project,  which  was  designed  to  evaluate  the  ability  telemammography  to  reduce  the  number  of  patients  being  recalled 
for  additional  imaging  procedures.  One  image  view  from  each  case  (i.e.,  twenty  images  total)  was  selected  to  represent 
each  examination.  The  verified  findings  depicted  in  these  examinations  included  masses  and  calcification  clusters 
(Table  1).  The  dataset  for  this  retrospective  study  was  assembled  and  analyzed  under  University  of  Pittsburgh 
Institutional  Review  Board  approved  protocol,  and  the  image  data  was  anonymized. 


Table  1 


Image  views 

and  depicted 

abnormalities 

View 

Mass 

Abnormality  depicted  on  image 
Calcifications  Mass  &  calcifications 

No  finding 

MLO 

3 

2  2 

3 

CC 

3 

3  1 

3 

MLO  -  mediolateral  oblique 
CC  -  craniocaudal 


2.2  Image  processing 

Mammographic  films  were  digitized  at  50  micron  pixel  dimensions  and  12-bit  grayscale  using  a  high-resolution,  laser 
film  digitizer  (Lumiscan  85,  Eastman  Kodak,  Rochester,  NY,  USA).  Each  digitized  mammographic  image  was 
automatically  cropped  to  decrease  the  non-tissue  area  surrounding  the  breast.  The  cropped  image  data  were  compressed 
using  the  irreversible  (lossy),  9/7  transform,  wavelet-based  JPEG  2000  method  at  compression  ratios  of  50:1  and  75:1 
and  subsequently  decompressed  prior  to  display.  A  total  of  sixty  images  were  generated  for  the  study,  the  twenty 
original  digitized  images  plus  two  compressed  images  at  50:1  and  75:1  ratios  for  each  of  these  (or  a  total  of  sixty 
images). 
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2.3  Image  display 

The  images  were  displayed  on  two  calibrated,  high-resolution  (2048  x  2560),  8-bit  grayscale,  portrait  monitors  at  a 
nominal  setting  of  80  ftL  (DS5100P,  Clinton  Electronics,  Rockford,  IL,  USA).  Typically,  when  a  single  image 
displayed  on  the  monitor  the  display  scale  was  approximately  100  micron  per  pixel.  Minimal  unsharp  masking  was 
employed.  In  short,  image  data  were  first  smoothed  with  a  2-D  129  mean  kernel,  and  subsequently  the  weighted  (0.10) 
smoothed  image  was  subtracted  from  the  original  image.  Finally,  the  resulting  pixel  values  were  re-scaled  from  0  to 
4095.  Image  magnification  and  window/level  adjustments  were  not  permitted  during  the  study. 

Fixed  look-up  table  (LUT)  values  are  automatically  calculated  based  on  the  pixel  value  distribution  (histogram).  In 
short,  the  typical  pixel  value  distribution  of  digitized  mammographic  images  is  bimodal.  The  center  between  the  two 
modes  was  set  as  the  level  value  (brightness),  and  the  span  of  the  two  modes  was  set  as  the  window  value  (contrast). 
Additionally,  the  cropped  images  were  padded  (filled)  prior  to  display  to  restore  the  full  height  of  the  image. 

2.4  Study  protocol 

Six  experienced  radiologists  participated  in  the  study.  They  were  presented  image  pairs  (one  image  per  monitor)  that 
consisted  of  the  same  image  at  different  levels  of  compression  (Fig,  1).  The  images  were  paired  in  three  modes:  mode 
1,  non-compressed  versus  50:1  compression;  mode  2,  non-compressed  versus  75:1  compression;  and  mode  3,  50:1 
versus  75:1  compression.  The  sixty  image  pairs  were  randomly  presented  with  20  randomly  selected  pairs  presented  a 
second  time  to  evaluate  intra-observer  variability  (or  a  total  of  eighty  pairs).  Compression  levels  were  also  randomly 
assigned  between  the  two  monitors  for  counterbalancing. 


Fig.  1.  Telemammography  workstation  used  for  the  study. 


In  a  2-AFC  paradigm  the  radiologists  were  forced  to  choose  the  image  (i.e.,  right  or  left  monitor)  that  had  the  lower 
level  of  data  compression.  In  addition,  they  compared  and  rated  the  clinical  utility  between  the  two  images  presented  in 
each  pair.  After  image  review,  two  questions  were  presented  on  a  computer  scoring  form  and  answered  using  the 
computer  mouse  (Fig.  2).  The  radiologists  were  given  written  instructions  regarding  the  protocol: 
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You  will  be  presented  with  80  pairs  of  images,  one  image  on  each  monitor.  The  window  and 
level  values  for  the  monitor  display  will  be  fixed.  Magnification  features  will  not  be  available 
during  this  study.  One  image  will  contain  less  information  than  the  other  as  a  result  of  data 
compression.  The  monitor  that  displays  the  less  compressed  image  will  be  randomly  selected. 
The  same  image  pairs  will  appear  multiple  times  throughout  the  study.  After  you  have 
reviewed  the  images,  the  “eval  case”  button  on  the  bottom  task  bar  will  bring  up  two  questions 
to  be  answered. 


Which  monitor  contains  the  image  with  more  information? 

r  Left 
r  Right 

If  these  images  were  part  of  a  screening  mammogram  exam,  for  the  purpose  of 
determining  the  need  for  additional  procedures: 

r  The  left  image  is  superior  to  the  right  image, 
r  The  left  image  is  equivalent  to  the  right  image, 
r  The  left  image  is  inferior  to  the  right  image. 

Done 


Fig.  2  Computerized  scoring  form  completed  for  each  image  pair. 


2.5  Data  analysis 

The  average  percent  correct  decisions  across  the  six  readers  for  discriminating  the  lower  level  of  image  compression 
was  compared  with  a  random  (chance)  selection  using  a  one-sample  T-test  for  each  mode  and  each  monitor.  Friedman 
Two-Way  Analysis  of  Variances  by  Ranks  was  used  to  test  if  there  was  a  difference  between  modes.  Kappa  was  used 
to  evaluate  intra-reader  agreement  for  the  twenty  repeated  pairs  of  images  and  inter-reader  agreement  for  each  mode. 
To  determine  if  a  learning  effect  was  present  the  percent  correct  decision  for  the  first,  second,  and  third  presentations  of 
pairs  of  images  was  tested  for  trend  using  the  Page  Test  for  Ordered  Alternatives.  All  images  were  presented  a 
minimum  of  three  times  with  the  twenty  repeated  pairs  randomly  selected.  The  percent  of  image  pairs  rated  as 
clinically  equivalent  for  both  the  correct  or  incorrect  decisions  for  identifying  the  lower  level  of  image  compression 
were  compared  to  random  (chance)  selection  using  a  one-sample  T-test  for  each  mode  and  each  monitor. 


3.  RESULTS 

The  subjective  appearance  of  the  compressed  images  was  extremely  similar  to  the  original  uncompressed  image.  The 
task  of  discriminating  the  more  compressed  image  in  each  pair  was  reported  to  be  difficult  by  all  readers.  The 
smoothing  effect  of  wavelet  compression  did  not  produce  distinguishable  image  features  such  as  blocking  artifacts 
characteristic  of  high-level  original  JPEG  compression. 

Readers’  ability  to  correctly  discriminate  the  lower  level  of  image  compression  was  only  slightly  better  than  chance  and 
was  of  the  same  order  of  magnitude  as  the  “monitor  effect”  (Table  2).  Readers’  performance  levels  were  not 
significantly  different  across  the  three  presentation  modes  (p  >  0.05).  However,  the  readers  correctly  identified  images 
compressed  at  50:1  ratio  as  lower  than  75:1  image  compression  at  a  rate  significantly  greater  than  chance  (p  <  0.05). 
On  average  the  readers  performed  better  when  the  lower  level  of  compression  was  presented  on  the  left  monitor  for  all 
three  modes,  but  the  “monitor  effect”  (left  versus  right)  was  not  significant. 
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Table  2 

Average  percent  correct  for  discriminating  the  lower  compression  level  for  all  image 
pairs  when  the  correct  image  was  on  the  right  monitor  and  the  left  monitor 

_ mode  1 _  mode  2 _ mode  3C 

All  images  52.5  (11.3)  58.3  (14.7)  58.3  (7.5)e 

Images  on  right  monitor  45.7(25.3)  43.2(25.8)  47.8(26.5) 

Images  on  left  monitor _ 62.5(14.1) _ 73.2  (25.3) _ 69.0(23.1) 

a  mode  1  -  non-compressed  &  50:1  compression 

b  mode  2  -  non-compressed  &  75:1  compression 

c  mode  3  -  50:1  &  75:1  compression 
d  group  mean  and  standard  deviation  in  ( ) 
e  p  <  0.05  one  sample  T-test 


Intra-  and  inter-reader  agreements  for  discriminating  the  lower  level  of  data  compression  were  poor  for  the  individual  as 
well  as  between  readers  (Tables  3  and  4).  Kappa  for  intra-reader  agreement  for  readers  1,  2,  3,  4,  5,  and  6  were  -0.25, 
-0.39,  -0.30,  -0.19,  -0.78,  and  -0.30,  respectively.  No  two  readers  consistently  agreed  across  the  three  presentation 
modes.  Inter-reader  Kappa  for  discriminating  the  lower  level  of  image  compression  for  the  six  readers  ranged  from 
-0.47  to  0.26,  -0.36  to  0.37,  and  -0.30  to  0.30  for  modes  1, 2,  and  3,  respectively  (Table  4). 


Table  3 


Comparison  between  the  first  and  second 
reads  of  the  twenty  repeated  image  pairs 


reader 

first  read 

second  read 
correct"  incorrect 

1 

correct 

10(2) 

30(6) 

incorrect 

30(6) 

30(6) 

2 

correct 

15(3) 

30  (6) 

incorrect 

40  (8) 

15(3) 

3 

correct 

20(4) 

30  (6) 

incorrect 

35(7) 

15(3) 

4 

correct 

5(1) 

25  (5) 

incorrect 

25  (5) 

45  (9) 

4 

correct 

5(1) 

40  (8) 

incorrect 

50  (10) 

5(1) 

5 

correct 

10(2) 

50  (10) 

incorrect 

20(4) 

20(4) 

a  percentage  and  number  in  ( ) 


Table  4 


Kappa  for  inter-reader  agreement  for  the  six  readers  and  the 
three  presentation  modes _ 


mode 

reader 

2 

3 

reader 

4 

5 

6 

la 

1 

-0.471 

-0.042 

0.043 

-0.200 

-0.038 

2 

0.118 

0.223 

-0.100 

-0.237 

3 

0.255 

-0.200 

0.151 

4 

-0.100 

-0.101 

5 

-0.300 

2b 

1 

0.175 

-0.359 

-0.354 

-0.300 

0.368 

2 

-0.284 

-0.023 

0.100 

-0.177 

3 

0.018 

0.100 

-0.217 

4 

-0.200 

0.125 

5 

-0.300 

3C 

1 

-0.099 

-0.300 

0.121 

0.100 

-0.237 

2 

-0.100 

-0.099 

0.100 

0.175 

3 

0.100 

-0.200 

0.100 

4 

0.300 

-0.031 

5 

-0.100 

a  mode  1  -  non-compressed  &  50:1  compression 
b  mode  2  -  non-compressed  &  75:1  compression 
c  mode  3  -  50:1  &  75:1  compression 


A  slight  learning  effect  was  observed  in  the  average  reader’s  ability  to  select  the  lower  level  of  image  compression 
during  the  first  three  presentations  (Table  5).  The  mean  percent  for  correctly  discriminating  the  lower  level  of  image 
compression  showed  an  increasing  trend  across  the  three  presentations  that  was  not  significant  (p  >  0.05).  Reader  6  was 
an  outlier,  and,  although  the  trend  was  not  significant,  excluding  this  reader  from  the  analysis  removed  the  increasing 
trend  across  the  three  presentations. 
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Table  5 


Percent  correct  for  selecting  the  less  compressed  image  during  the  first, 
second,  and  third  presentations 


reader 

first  (n=  20) 

second  (n  =  20) 

third  (n  =  20) 

1 

65.0 

50.0 

60.0 

2 

55.0 

55.0 

60.0 

3 

50.0 

50.0 

55.0 

4 

60.0 

65.0 

70.0 

5 

50.0 

50.0 

50.0 

6 

35.0 

80.0 

65.0 

mean 

52.5 

58.3 

60.0a 

std 

10.4 

12.1 

7.1 

ap  >  0.05 


Images  correctly  identified  as  less  compressed  by  the  readers  were  rated  as  “clinically  equivalent”  at  relatively  the  same 
rate  as  images  incorrectly  identified  (Table  6).  However,  on  the  left  monitor  the  readers  rated  correctly  selected  images 
as  “clinically  equivalent”  more  often  than  random  selection  (p  <  0.05).  The  average  number  of  image  pairs  rated  as 
clinically  equivalent  by  the  six  radiologist  were  14.2  (±  4.8),  14.2  (±  4.1),  and  13.3  (±  5.5)  out  of  the  twenty  possible 
pairs  for  modes  1,  2,  and  3,  respectively. 


Table  6 

Percent  of  image  pairs  rated  “clinically  equivalent”  for  correct  and  incorrect  selection  of  lower  compression 
level  for  either  monitor,  the  right  monitor,  and  the  left  monitor _ 


correct  choice  of  lower  compression  level  incorrect  choice  of  lower  compression  level 


mode 

either  monitor11 

right  monitor 

left  monitor 

either  monitor 

right  monitor 

left  monitor 

la 

48.3  (20.1) 

24.0(13.6) 

24.3  (15.2) 

51.7  (20.1) 

34.2  (24.9) 

17.5(10.5) 

2b 

62.3  (19.2) 

18.9(15.6) 

43.4  (17.8)e 

37.7  (19.2) 

26.3  (15.9) 

11.4  (12.9)c 

3C 

53.1  (18.6) 

19.2(15.4) 

33.9(17.0) 

46.9(18.6) 

27.1  (20.7) 

19.8  (14.7) 

a  mode  1  -  non-compressed  &  50:1  compression 
b  mode  2  -  non-compressed  &  75:1  compression 
c  mode  3  -  50:1  &  75:1  compression 
d  group  mean  and  standard  deviation  in  ( ) 
c  p  <  0.05  one  sample  T-test 


4.  DISCUSSION 

In  this  controlled  evaluation,  image  compression  achieved  with  wavelet-based  JPEG  2000  was  not  reliably 
discriminated  and  rated  by  radiologists  and,  therefore,  could  be  considered  applicable  for  telemammography 
applications.  Radiologists  did  not  accurately  or  reliably  select  the  lower  level  of  image  compression  between  image 
pairs  when  presented  side-by-side  with  non-compressed  images  and  those  compressed  at  50:1  and  75:1  compression 
levels.  Interestingly,  the  “monitor  effect”  (left  versus  right)  was  of  the  same  order  of  magnitude  as  the  radiologists’ 
ability  to  discriminate  the  lower  level  of  image  compression.  As  a  group  the  readers’  ability  to  identify  the  lower  level 
of  data  compression  slightly  improved  across  the  readings,  but  not  significantly.  The  majority  of  image  pairs,  which 
were  compressed  at  different  ratios,  were  rated  as  “clinically  equivalent”  for  use  in  a  screening  environment 
independent  of  whether  the  readers  selected  correctly  or  incorrectly  the  less  compressed  image. 

The  images  in  our  study  were  presented  on  separate,  side-by-side  monitors  with  magnification,  pan  zoom,  and 
window/level  features  disabled,  Permitting  magnification  and  window/level  may  (or  may  not)  have  improved 
discrimination.  A  similar  2-AFC  study  by  Slone  et  al.17  (2000)  evaluated  wavelet  and  original  JPEG  compression  of 
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posteroanterior  chest  digital  radiographs  and  reported  that  image  degradation  was  detected  at  compression  levels  greater 
than  11:1  for  both  compression  methods.  At  a  compression  level  of  75:1  the  lower  compressed  image  was  correctly 
identified  approximately  95  %  of  the  time  for  both  the  wavelet  and  the  JPEG  compression  methods.  The  images  were 
presented  on  a  single  monitor,  and  the  readers  were  permitted  to  magnify  and  toggle  between  images,  which  they 
acknowledged  was  conservative  and  tested  the  reader’s  temporal  sensitivity. 

Since  radiologists  could  not  accurately  or  reliably  discriminate  non-compressed  and  highly-compressed  mammographic 
images,  their  interpretation  using  either  non-compressed  or  highly-compressed  images  is  not  likely  to  differ 
substantially.  We  also  note  that  diligent  monitor  calibration  may  be  critical  to  image  fidelity. 
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OBJECTIVE.  We  assessed  performance  changes  of  a  mammographic  computer-aided 
detection  scheme  when  we  restricted  the  maximum  number  of  regions  that  could  be  identified 
(cued)  as  showing  positive  findings  in  each  case. 

MATERIALS  AND  METHODS.  A  computer-aided  detection  scheme  was  applied  to 
500  cases  (or  2,000  images),  including  300  cases  in  which  mammograms  showed  verified 
malignant  masses.  We  evaluated  the  overall  case-based  performance  of  the  scheme  using  a 
ffee-response  receiver  operating  characteristic  approach,  and  we  measured  detection  sensitiv¬ 
ity  at  a  fixed  false-positive  detection  rate  of  0.4  per  image  after  gradually  reducing  the  maxi¬ 
mum  number  of  cued  regions  allowed  for  each  case  from  seven  to  one. 

RESULTS.  The  original  computer-aided  detection  scheme  achieved  a  maximum  case-based 
sensitivity  of  97%  at  3.3  false-positive  detected  regions  per  image.  For  a  detection  decision  score 
set  at  0.565,  the  scheme  had  a  79%  (237/300)  case-based  sensitivity,  with  0.4  false-positive  de¬ 
tected  regions  per  image.  After  limiting  the  number  of  maximum  allowed  cued  regions  per  case, 
the  false-positive  rates  decreased  faster  than  the  true-positive  rates.  At  a  maximum  of  two  cued  re¬ 
gions  per  case,  the  false-positive  rate  decreased  from  0.4  to  0.21  per  image,  whereas  detection 
sensitivity  decreased  from  237  to  220  masses.  To  maintain  sensitivity  at  79%,  we  reduced  the 
detection  decision  score  to  as  low  as  0.36,  which  resulted  in  a  reduction  of  false-positive  de¬ 
tected  regions  from  0.4  to  0.3  per  image  and  a  reduction  in  region-based  sensitivity  from 
66.1%  to  61.4%. 

CONCLUSION.  Limiting  the  maximum  number  of  cued  regions  per  case  can  improve 
the  overall  case-based  performance  of  computer-aided  detection  schemes  in  mammography. 
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Computer-aided  detection  systems 
are  routinely  used  in  a  number  of 
medical  institutions  around  the 
world  to  assist  radiologists  in  the  detection  of 
abnormalities  depicted  on  mammograms.  The 
number  of  mammograms  scanned  through 
commercial  computer-detection  systems  has 
been  rapidly  increasing.  Although  no  general 
agreement  has  been  reached  on  how  computer- 
aided  detection  affects  radiologists’  perfor¬ 
mance  in  terms  of  sensitivity  and  specificity 
[1-4],  there  are  indications  that  the  performance 
of  the  computer-aided  detection  scheme  itself 
has  an  impact  on  radiologists’  performance  in 
detecting  abnormalities  [5,  6],  and  observer 
confidence  levels  in  accepting  the  cues  gener¬ 
ated  by  these  systems  increases  with  higher  per¬ 
formance  levels  of  the  scheme  [7,  8].  Several 
commercial  computer-aided  detection  systems 
have  been  approved  by  the  United  States  Food 
and  Drug  Administration,  and  the  relative  per¬ 


formance  levels  of  such  systems  have  been 
compared  [9,  10].  All  commercial  computer- 
aided  detection  systems  use  specific  threshold 
values  to  determine  whether  an  identified  suspi¬ 
cious  region  is  ultimately  cued  as  a  positive 
finding,  and  the  performance  of  these  systems 
is  frequently  evaluated  on  the  basis  of  the 
case-based  sensitivity  achieved  at  a  given  false¬ 
positive  detection  rate.  In  a  case-based  (or  a 
breast-based)  analysis,  sensitivity  is  based  on  the 
correct  detection  of  at  least  one  true-positive  re¬ 
gion  on  either  the  craniocaudal  or  mediolateral 
oblique  mammographic  view  or  on  both  [1], 
Evaluation  of  computer-aided  detection  per¬ 
formance  is  not  a  simple  matter.  Previous  studies 
have  shown  that  performance  can  vary  widely 
depending  on  which  scoring  method  is  used,  and 
there  is  no  general  agreement  on  which  scoring 
method  should  be  used  for  this  purpose  [11, 12], 
One  study  showed  that  at  approximately  the 
same  false-positive  rate  (e.g.,  1 .5  per  image),  the 
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measured  sensitivity  for  the  detection  of  micro¬ 
calcification  clusters  ranged  between  45%  and 
85%  depending  on  which  of  three  different  as¬ 
sessment  methods  were  used  [11], 

In  addition,  computer-aided  detection  per¬ 
formance  depends  on  the  composition  of  the 
image  database  used  [13].  In  general,  com¬ 
puter-aided  detection  schemes  may  identify  a 
large  number  of  suspicious  regions  on  some 
images  (e.g.,  images  depicting  dense  tissue 
patterns),  but  only  a  few  suspicious  regions  on 
other  images  (e.g.,  images  dominated  by  fatty 
tissue)  [14].  Therefore,  limiting  the  maximum 
number  of  suspicious  regions  allowed  to  be 
cued  for  one  case  could  potentially  reduce  the 
false-positive  rate  with  a  relatively  small  de¬ 
crease  in  sensitivity.  This  approach  is  used  in 
commercially  available  systems,  but  to  the  best 
of  our  knowledge,  the  effect  of  implementing 
the  approach  on  image-  and  case-based  sensi¬ 
tivity  and  false-positive  detection  rates  has  not 
been  described  in  detail.  This  study  was  per¬ 
formed  to  assess  this  issue. 

Materials  and  Methods 

Wc  selected  500  cases  (or  2,000  digitized  mam¬ 
mograms)  from  a  large  image  database  available  in 
our  laboratoty.  Among  these  cases,  verified  malig¬ 
nant  masses  were  depicted  in  300  cases,  and  the  re¬ 
maining  200  were  negative  findings.  In  all  cases 
with  positive  findings,  a  panel  of  radiologists  identi¬ 
fied  the  locations  of  the  mass  regions  on  the  images 
using  the  original  diagnostic  and  biopsy  reports.  The 
central  coordinates  (x  and  y)  of  each  mass  region 


were  visually  identified,  marked,  and  saved  in  a 
“truth  file.”  In  this  data  set,  mass  regions  were  visi¬ 
ble  on  both  the  craniocaudal  and  mediolateral  ob¬ 
lique  mammographic  views  in  270  cases  and  were 
only  visible  on  one  of  the  two  views  in  30  cases. 
Thus,  570  mass  regions  were  identified  on  the  im¬ 
ages  in  this  study.  Figure  1  shows  the  size  distribu¬ 
tion  of  the  300  masses  in  the  data  set. 

A  computer  program  determined  the  size  of  each 
mass  region  by  counting  the  total  number  of  pixels  in¬ 
side  the  identified  boundary  contour  of  the  region 
(multiplied  by  0.0016  cm2  per  pixel).  The  size  of  a 
mass  was  represented  by  a  large  computed  area  on  ei¬ 
ther  the  craniocaudal  or  mediolateral  oblique  mam¬ 
mogram.  For  each  identified  mass  region,  the  panel  of 
radiologists  assigned  a  subjective  rating  of  subtlety 
using  a  5-point  rating  scale  that  ranged  from  1  (very 
easily  visible)  to  5  (very  subtly  visible).  Figure  2 
shows  the  distribution  of  assigned  subtlety  ratings  in 
this  data  set.  Subtlety  of  a  mass  was  represented  by 
the  lower  rating  assigned  to  either  the  craniocaudal  or 
mediolateral  oblique  mammographic  view.  We  verified 
all  cases  with  negative  (or  benign)  findings  by  review¬ 
ing  the  available  diagnostic  information  and  the  data 
from  a  follow-up  examination  with  negative  results, 
confirming  a  minimum  of  one  disease-free  year. 

A  computer-aided  detection  scheme  developed  pre¬ 
viously  in  our  laboratory  [  1 5]  was  applied  to  the  2,000 
images  in  the  data  set.  Because  we  only  examined 
computer-aided  detection  performance  for  mass  detec¬ 
tion  in  this  study,  each  image  was  first  reduced  by 
pixel  averaging  (a  factor  of  8  in  both  x  and  y  direc¬ 
tions),  increasing  the  effective  pixel  size  from  50  x  50 
/an  in  the  original  digitized  image  to  400  x  400  /an. 
The  mass  detection  scheme  then  identified  between  10 
and  30  suspicious  regions  in  each  image  depending  on 
the  regional  tissue  patterns.  For  each  identified  region, 


a  multilayer  regional  growth  algorithm  [16]  was  ap¬ 
plied  to  define  the  contours  of  the  region  as  depicted  in 
the  image.  If  the  region  met  simple  growth  criteria,  a 
set  of  features  from  the  interior  and  surrounding  back¬ 
ground  of  the  region  was  computed  by  the  scheme. 
Otherwise,  the  region  was  considered  to  have  negative 
findings  and  was  deleted.  Finally,  a  feature-based  arti¬ 
ficial  neural  network  classified  each  suspicious  region 
as  showing  positive  or  negative  findings  by  assigning  a 
detection  (or  probability)  score.  In  a  manner  similar  to 
the  commercial  computer-aided  detection  products, 
our  detection  scheme  identified  a  region  as  having  a 
positive  finding  if  the  detection  score  exceeded  a  pre¬ 
determined  threshold.  If  the  detection  score  did  not  ex¬ 
ceed  the  threshold,  the  region  was  not  cued  and  was 
considered  to  be  a  negative  finding. 

After  processing  all  images,  we  compared  the  re¬ 
gions  with  detected  positive  findings  with  the  results 
saved  in  the  truth  file.  To  determine  whether  a  de¬ 
tected  region  was  considered  a  true-positive  finding, 
we  applied  the  following  criterion:  If  the  distance 
between  the  computed  center  of  a  detected  region 
and  the  visually  marked  coordinate  on  a  mammo¬ 
gram  was  shorter  than  the  effective  radius  (the  aver¬ 
age  radial  length  computed  by  the  computer-aided 
detection  scheme),  the  region  was  considered  to  be  a 
match  to  a  hue-positive  mass.  Otherwise,  the  region 
was  considered  a  false-positive  case. 

To  show  the  original  performance  of  the  com¬ 
puter-aided  detection  scheme  when  applied  to  this 
data  set,  we  plotted  frec-response  receiver  operating 
characteristic  curves  for  both  case-based  and  region- 
based  scores.  In  the  case-based  performance  curve, 
sensitivity  was  assessed  on  the  basis  of  the  correct 
marking  of  at  least  one  true-positive  region  in  either 
(or  both)  of  the  two  mammographic  views,  and  if 
two  regions  were  detected,  the  higher  score  was  se- 
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Fig.  1. — Bar  graph  shows  size  distribution  of  300  masses  depicted  in  data  set  Mass  size 
is  represented  by  larger  depicted  area  on  either  craniocaudal  or  mediolateral  oblique 
mammographic  view. 
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Fig.  2.— Bar  graph  shows  distribution  of  subjectively  rated  subtlety  of  300  masses  de¬ 
picted  in  data  set  Subtlety  of  each  identified  mass  was  rated  on  5-point  scale,  ranging 
from  1  (very  easily  visible)  to  5  (very  subtly  visible).  Mass  subtlety  is  represented  by  lower¬ 
rated  depiction  on  either  craniocaudal  or  mediolateral  oblique  mammographic  view. 
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lectcd  to  represent  the  mass.  In  the  region-based  per¬ 
formance  curve,  if  the  same  mass  was  depicted  on 
both  craniocaudal  and  mediolateral  oblique  views, 
we  considered  these  two  images  to  represent  two  in¬ 
dependent  regions. 

We  applied  a  threshold  score  to  the  artificial  neu¬ 
ral  network  results  to  evaluate  the  sensitivity  of  the 
scheme  at  different  false-positive  rates.  We  also  ad¬ 
justed  the  threshold  value  to  produce  a  false-positive 
rate  comparable  to  that  of  the  leading  commercial 
computer-aided  detection  systems  (e.g.,  a  false-posi¬ 
tive  rate  of  0.4  regions  per  image  [2]).  By  changing 
the  total  number  of  cued  regions  permitted  in  each 
case  to  anywhere  from  seven  to  one,  we  compared 
the  change  in  performance  levels  (including  both 
sensitivity  and  false-positive  rate).  The  scores  gener¬ 
ated  by  the  artificial  neural  networks  for  all  detected 
regions  were  sorted  by  value  from  the  highest  to  the 
lowest,  and  the  regions  with  higher  scores  were  se¬ 
lected  sequentially  until  the  predetermined  limit  of 
cued  regions  per  case  was  reached.  In  addition,  we 
kept  the  case-based  sensitivity  constant  by  reducing 
the  detection  threshold  and  assessed  the  changes  in 
false-positive  rates  and  image-based  sensitivity  as 
the  total  number  of  allowed  cues  per  case  was  re¬ 
duced  from  seven  to  two. 

Results 

Figure  3  shows  two  computed  free-response 
receiver  operating  characteristic  curves  after 
the  application  of  our  computer-aided  detec¬ 
tion  scheme  to  this  data  set.  One  is  a  case- 
based  free-response  receiver  operating  charac¬ 
teristic  performance  curve;  the  other  is  a  re¬ 
gion-based  curve.  Setting  the  threshold  value 
of  the  artificial  neural  network  detection  scores 
at  0.565  generated  a  decision  threshold  line,  as 
shown  in  Figure  3.  At  this  level,  the  computer- 
aided  detection  scheme  identified  79%  of  the 
malignant  masses  with  0.4  false-positive  re¬ 
gions  per  image  being  cued.  At  this  threshold, 
the  scheme  did  not  detect  any  false-positive  re¬ 
gions  in  33.2%  (166/500)  of  the  cases. 

Table  1  provides  the  performance  levels  of 
the  computer-aided  detection  scheme  when  we 
limited  the  maximum  number  of  cued  regions 
allowed  in  one  case  at  this  threshold  level 
(0.565).  The  false-positive  detection  rate  de¬ 
creased  substantially  faster  than  the  case-based 
sensitivity.  For  example,  when  we  limited  the 
maximum  number  of  cued  regions  to  two  per 
case,  the  detection  sensitivity  decreased  by 
7.2%  (from  237/300  to  220/300  cases),  whereas 
the  false-positive  detection  rate  decreased  by 
47.3%  (from  0.40  to  0.21  per  image).  In  65%  of 
the  true-positive  cases,  the  region  with  the  high¬ 
est  artificial  neural  network  score  was  the  ma¬ 
lignant  mass  region  (Table  1). 

Figure  4  shows  five  free-response  receiver 
operating  characteristic  curves  generated  when 


Fig.  3. — Graph  illustrates  overall  performance  of  computer-aided  detection  scheme  when  applied  to  database 
of  2,000  mammograms  (500  cases)  with  no  limitation  on  number  of  cued  regions.  Detection  decision  threshold 
line  is  represented  by  dotted  line.  ♦  =  case-based  free-response  receiver  operating  characteristic  curve,  O  = 
image-based  free-response  receiver  operating  characteristic  curve. 


the  maximum  allowed  number  of  cues  per 
case  was  limited  to  between  seven  and  two.  As 
the  maximum  number  of  allowed  cues  was  re¬ 
duced,  the  free-response  receiver  operating 
characteristic  curves  tended  to  become  steeper. 
Table  2  summarizes  the  results  after  limiting 
the  maximum  number  of  cued  regions  and 
changing  the  threshold  value  of  the  artificial 
neural  network  detection  scores  to  maintain  a 
79%  case-based  sensitivity.  The  table  shows 
that  we  were  able  to  reduce  the  false-positive 
rates  while  maintaining  a  constant  sensitivity. 


For  example,  by  limiting  the  maximum  al¬ 
lowed  number  of  cues  to  two  per  case  and  ad¬ 
justing  the  artificial  neural  network  threshold 
to  0.36,  we  reduced  the  false-positive  rate  from 
0.4  to  0.3  regions  per  image. 

One  interesting  finding  was  that  the  17  (of 
the  237)  masses  detected  using  these  two  scor¬ 
ing  methods  were  not  identical.  When  the 
maximum  number  of  cued  regions  was  limited 
to  two  per  case,  17  masses  with  artificial  neu¬ 
ral  network  scores  higher  than  0.565  (range, 
0.57-0.77)  were  eliminated.  Reducing  the 


TABLE  1 


Performance  Levels  of  Computer-Aided  Detection  as  a  Function  of  the 
Maximum  Number  of  Cued  Regions  Allowed  per  Case 


Maximum  No.  of 
Cued  Regions 
Allowed  per  Case 

Sensitivity3 

False-Positive  Regionsb 

Case-Based 

Region-Based 

No.c 

% 

No.d 

% 

No. 

Per- Image 
Rate 

No  limit 

237 

79.0 

377 

66.1 

803 

0.40 

7 

237 

79.0 

376 

66.0 

795 

0.40 

5 

236 

78.7 

370 

64.9 

753 

0.38 

4 

233 

77.7 

364 

63.9 

695 

0.35 

3 

227 

75.7 

351 

61.6 

588 

0.29 

2 

220 

73.3 

316 

55.4 

423 

0.21 

1 

195 

65.0 

195 

34.2 

224 

0.11 

Note. — Artificial  neural  network  threshold  vaue  was  set  at  0.565. 


detected  true-positive  cases, 
bDetected  false-positive  regions. 
cCases. 
dRegions. 
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Fig.  4. — Graph  shows  five  plots  depicting  free-re- 
sponse  receiver  operating  characteristic  curves 
generated  by  different  maximum  numbers  of  cued 
regions  allowed  per  case.  Maximum  number  of 
cued  regions  indicated  byA  =  nolimit,  B=<7,  )K=<5, 
A=£3,0=<2. 


threshold  score  to  0.36  resulted  in  the  identifi¬ 
cation  of  17  different  masses  with  artificial 
neural  network  scores  in  the  range  between 
0.36  and  0.51.  Figure  5  shows  the  distribution 
of  mass  sizes  and  subtlety  ratings  of  the  34 
masses  missed  by  both  scoring  methods.  The 
results  suggest  that  the  17  masses  that  were 
detected  only  when  the  number  of  allowed 
cues  was  limited  to  two  per  case  and  the 
threshold  was  lowered  tended  to  be  some¬ 
what  small.  All  34  masses  were  actually  posi¬ 
tive  findings.  At  this  time,  the  follow-up 
period  on  these  patients  has  not  been  long 


enough  to  assess  the  difference  (if  any)  in 
clinical  impact  of  the  two  approaches. 

Discussion 

Case  distributions  and  rating  methods 
could  have  a  significant  effect  on  the  evalua¬ 
tion  of  computer-aided  detection  perfor¬ 
mance  levels  [11-13].  In  this  study,  we  tested 
a  simple  scoring  method  that  alters  measured 
performance.  The  method  of  limiting  the 
maximum  number  of  cued  regions  allowed 
per  case  is  commonly  used  in  commercial 


Performance  Levels  of  Computer-Aided  Detection  with  Constant 
Sensitivity  of  79%  as  a  Function  of  the  Maximum  Number  of  Cued  Regions 
Allowed  per  Case 


Maximum  No.  of 
Cued  Regions 
Allowed  per  Case 

Region-Based  Sensitivity3 

False-Positive  Rateb 

Detection  Decision  Value 
of  Artificial  Neural 
Network  Scores 

No.c 

% 

No. 

Per-lmage 

Rate 

No  limit 

377 

66.1 

803 

0.40 

0.565 

5 

371 

65.1 

773 

0.39 

0.560 

4 

378 

66.3 

902 

0.45 

0.500 

3 

375 

65.8 

781 

0.39 

0.470 

2 

350 

61.4 

604 

0.30 

0.360 

“Detected  true-positive  cases, 
detected  false-positive  regions. 
“Regions. 


computer-aided  detection  products.  However, 
the  actual  scores  for  each  region  are  not  avail¬ 
able  to  users.  Therefore,  several  related  is¬ 
sues — such  as  the  effect  of  this  approach  on 
overall  performance  and  on  the  detection  (or 
the  missed  detection)  of  specific  masses — 
have  not,  to  our  knowledge,  been  described  in 
the  past. 

Our  study  showed  that  by  limiting  the  maxi¬ 
mum  number  of  allowed  regions  to  be  cued  in 
each  case,  a  substantial  fraction  of  false-positive 
regions  can  be  eliminated  with  only  a  small  de¬ 
crease  in  sensitivity.  If  one  wishes  to  maintain 
sensitivity,  threshold  values  can  be  appropri¬ 
ately  adjusted  for  this  purpose.  Because  most 
masses  were  visible  on  both  the  craniocaudal 
and  mediolateral  oblique  mammograms  and  be¬ 
cause  the  detection  performance  of  computer- 
aided  detection  systems  is  commonly  evaluated 
using  case-based  sensitivity,  our  results  are  quite 
encouraging.  It  appears  that  this  approach  could 
reduce  the  false-positive  detection  rate  of  the 
scheme  and  possibly  eliminate  some  true-posi¬ 
tive  region-based  detections  while  retaining  the 
initial  (unrestricted  number  of  cues)  case-based 
sensitivity.  Although  the  sensitivity  can  be 
maintained  using  this  approach  (changing  the 
threshold  levels  for  detection),  one  does  not  de¬ 
tect  exactly  the  same  true-positive  masses.  We 
found  that  limiting  the  maximum  number  of 
cues  allowed  per  case  and  adjusting  the  thresh- 
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Fig.  5—  Scatterplot  shows  sizes  and  subtlety  ratings  distributions  for  34  masses  that  were  undetected  by  both 
case-based  and  image-based  scoring  methods.  ♦  =  no  limit  to  number  of  regions  in  each  case  that  may  be  cued 
as  showing  positive  findings,  X  -  maximum  number  of  regions  that  may  be  cued  is  <  2. 


old  appropriately  increased  computer-aided  de¬ 
tection  sensitivity  in  the  subset  of  smaller 
masses.  In  general,  this  effect  is  desirable  in  that 
it  could  reduce  the  number  of  regions  that  have 
to  be  ruled  out  by  the  radiologist  We  caution 
that  the  use  of  this  approach  may  not  yield  im¬ 
provements  of  similar  magnitude  in  the  clinical 
environment  with  a  substantially  different  distri¬ 
bution  of  truly  positive  and  truly  negative  cases. 

It  should  be  noted  that  the  size  and  subtlety 
ratings  of  masses  in  the  data  set  were  some¬ 
what  conservative.  In  Figures  1  and  2,  we  used 
the  larger  of  the  sizes  computed  for  a  mass 
from  the  two  mammographic  views  and  pre¬ 
sented  the  less  subtle  rating  for  the  same  mass. 
Hence,  distribution  based  on  image  or  region 
would  show  a  somewhat  smaller  average  mass 
size  and  a  more  subtle  data  set. 

Only  malignant  masses  were  considered 
true-positive  identifications  in  this  study.  In 
visually  assessing  the  false-positive  regions 
with  higher  scores  (e.g.,  >  0.7),  we  found  that 


1 9%  (40/2 1 3)  of  these  regions  represented  well- 
defined  benign  masses  (i.e.,  round  benign 
masses  with  high  contrast  and  relatively  sharp 
margins).  Considering  the  detection  of  benign 
masses  as  either  true-positive  or  false-positive 
may  have  a  substantial  impact  on  the  evaluation 
of  computer-aided  detection  performance  lev¬ 
els.  Because  of  the  approach  we  used  to  reduce 
the  number  of  cued  regions  per  case  and  be¬ 
cause  of  the  size  and  diversity  of  the  data  set 
used,  we  believe  that  our  results  are  not  unique 
to  our  own  computer-aided  detection  scheme. 
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Changes  in  Breast  Cancer  Detection  and  Mammography 
Recall  Rates  After  the  Introduction  of  a  Computer- 
Aided  Detection  System 

David  Gur,  Jules  H.  Sumkin,  Howard  E.  Rockette,  Marie  Ganott,  Christiane 
Hakim,  Lara  Hardesty,  William  R.  Poller,  Ratan  Shah,  Luisa  Wallace 


Background:  Computer-aided  mammography  is  rapidly 
gaining  clinical  acceptance,  but  few  data  demonstrate  its 
actual  benefit  in  the  clinical  environment.  We  assessed 
changes  in  mammography  recall  and  cancer  detection  rates 
after  the  introduction  of  a  computer-aided  detection  system 
into  a  clinical  radiology  practice  in  an  academic  setting. 
Methods:  We  used  verified  practice-  and  outcome-related 
databases  to  compute  recall  rates  and  cancer  detection  rates 
for  24  Mammography  Quality  Standards  Act- certified  aca¬ 
demic  radiologists  in  our  practice  who  interpreted  115  571 
screening  mammograms  with  (n  =  59  139)  or  without  (n  = 
56  432)  the  use  of  a  computer-aided  detection  system.  All 
statistical  tests  were  two-sided.  Results:  For  the  entire  group 
of  24  radiologists,  recall  rates  were  similar  for  mammograms 
interpreted  without  and  with  computer-aided  detection 
(11.39%  versus  11.40%;  percent  difference  =  0.09,  95% 
confidence  interval  [Cl]  =  —  11  to  11;  P  =  .96)  as  were  the 
breast  cancer  detection  rates  for  mammograms  interpreted 
without  and  with  computer-aided  detection  (3.49%  versus 
3.55%  per  1000  screening  examinations;  percent  difference 
=  1.7,  95%  Cl  =  —11  to  19;  P  =  .68).  For  the  seven 
high-volume  radiologists  (i.e.,  those  who  interpreted  more 
than  8000  screening  mammograms  each  over  a  3-year  pe¬ 
riod),  the  recall  rates  were  similar  for  mammograms  inter¬ 
preted  without  and  with  computer-aided  detection  (11.62% 
versus  11.05%;  percent  difference  =  —4.9, 95%  Cl  =  —21  to 
4;  P  =  .16),  as  were  the  breast  cancer  detection  rates  for 
mammograms  interpreted  without  and  with  computer-aided 
detection  (3.61%  versus  3.49%  per  1000  screening  examina¬ 
tions;  percent  difference  =  —3.2, 95%  Cl  =  — 15  to  9;  P  =  .54). 
Conclusion:  The  introduction  of  computer-aided  detection 
into  this  practice  was  not  associated  with  statistically  signif¬ 
icant  changes  in  recall  and  breast  cancer  detection  rates, 
both  for  the  entire  group  of  radiologists  and  for  the  subset  of 
radiologists  who  interpreted  high  volumes  of  mammograms. 
[J  Natl  Cancer  Inst  2004;96:185-90] 


A  mounting  body  of  evidence  suggests  that  early  detection  of 
breast  cancer  through  periodic  mammography  screening  reduces 
the  morbidity  and  mortality  associated  with  this  disease  (1,2). 
Mammography  screening  is  rapidly  gaining  acceptance  world¬ 
wide,  and  the  number  of  mammography  procedures  performed 
continues  to  increase  (3,4).  However,  mammography  screening 
has  a  relatively  low  cancer  detection  rate  of  only  two  to  six 
cancers  per  1000  mammograms  after  the  first  2  years  of  screen¬ 
ing  (5). 


The  performance  levels  among  radiologists  who  read  and 
interpret  mammograms  vary  widely.  Several  factors  may  ac¬ 
count  for  this  variability.  These  include,  but  are  not  limited  to, 
the  low  incidence  of  breast  cancer,  the  difficulty  in  identifying 
suspicious  (i.e.,  potentially  malignant)  regions  in  the  surround¬ 
ing  breast  tissue,  and  the  tedious  and  somewhat  repetitious 
nature  of  the  task  of  reading  mammograms  (5-7). 

In  recent  years,  a  major  effort  has  been  expended  to  develop 
computer-aided  detection  systems  to  assist  radiologists  with  the 
diagnostic  process.  The  hope  is  that  these  computer-aided  de¬ 
tection  systems  will  improve  the  sensitivity  of  mammography 
without  substantially  increasing  mammography  recall  rates,  in 
addition  to  possibly  decreasing  inter-reader  variability.  These 
systems  are  intended  for  the  early  detection  of  breast  cancer  and, 
accordingly,  are  designed  to  assist  the  radiologist  in  the  identi¬ 
fication  (i.e.,  detection)  of  suspicious  regions  (i.e.,  findings), 
such  as  clustered  microcalcifications  and  masses  (8-10). 
Computer-aided  diagnosis  (discrimination)  systems  are  cur¬ 
rently  being  developed  to  help  radiologists  determine  whether  an 
identified  suspicious  region  is  likely  to  represent  a  benign  or  a 
malignant  finding  (11-13). 

The  U.S.  Food  and  Drug  Administration  (FDA)  has  approved 
several  computer-aided  detection  systems  for  clinical  use,  and 
Medicare  and  many  insurance  companies  have  approved  reim¬ 
bursement  for  the  use  of  these  systems  in  clinical  practice.  The 
initial  FDA  approval  process  for  these  systems  included  retro¬ 
spective  interpretations  of  select  groups  of  cases  in  a  laboratory 
environment  (9,14,15).  Results  of  these  studies  (9,15)  suggest 
that  the  use  of  computer-aided  detection  systems  can  potentially 
increase  cancer  detection  rates  by  approximately  20%  without 
substantially  increasing  recall  rates.  However,  there  are  only 
limited  data  on  the  impact  of  such  systems  when  used  prospec¬ 
tively  in  a  clinical  environment  (16-19).  We  used  large,  pro¬ 
spectively  ascertained  databases  to  evaluate  the  recall  and  cancer 
detection  rates  in  our  clinical  breast  imaging  practice  in  an 
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academic  setting  for  a  3 -year  period  during  which  a  computer- 
aided  diagnosis  system  was  introduced. 

Methods 

Subjects  and  General  Procedures 

All  screening  mammography  examinations  performed  in  our 
facilities  at  Magee-Womens  Hospital  of  the  University  of  Pitts¬ 
burgh  Medical  Center  (Pittsburgh,  PA)  and  its  five  satellite 
breast  imaging  clinics  during  2000,  2001,  and  2002  were  in¬ 
cluded  in  this  study.  Our  study  was  carried  out  under  an  insti¬ 
tutional  review  board-approved  protocol. 

The  data  sources  for  our  analysis  were  databases  that  con¬ 
tained  information  on  procedure  scheduling,  procedure  comple¬ 
tion,  radiology  reporting,  and  procedure-related  outcomes  as 
determined  from  relevant  pathology  reports.  These  databases 
were  assembled  from  the  original  reports  for  quality  assurance 
purposes,  as  required  by  the  Mammography  Quality  Standards 
Act  (MQSA)  (20),  among  other  reasons.  The  same  computerized 
reporting  system  was  in  use  throughout  the  study  period. 

In  the  second  quarter  of  2001,  we  introduced  a  computer- 
aided  detection  system  (R2  Technologies,  Los  Altos,  CA)  into 
our  clinical  practice  at  the  main  facility,  where  most  of  the 
screening  mammograms  in  our  practice  were  read  in  batch 
mode.  By  the  third  quarter  of  2001,  more  than  70%  of  the 
screening  mammograms  were  interpreted  with  use  of  the 
computer-aided  detection  system.  By  the  fourth  quarter  of  2001, 
more  than  80%  of  the  screening  mammograms  were  interpreted 
with  the  assistance  of  the  computer-aided  detection  system.  The 
radiologists  in  our  practice  could  not  select  which  mammograms 
would  be  interpreted  with  or  without  the  computer-aided  detec¬ 
tion  system.  After  training  on  the  computer-aided  detection 
system  was  completed  (June  2001),  all  screening  mammograms 
interpreted  in  our  main  facility  were  processed  by  and  inter¬ 
preted  with  the  assistance  of  the  computer-aided  detection  sys¬ 
tem.  Radiologists  at  the  five  satellite  clinics  sometimes  reviewed 
screening  mammograms  if  time  allowed,  but  the  number  of  these 
cases  was  small,  and  there  was  no  selection  process  that  could 
bias  the  analyses  performed  in  this  study.  Knowing  the  schedule 
for  radiologists’  presence  at  the  remote  sites,  we  assembled  a 
batch  of  serially  acquired  mammograms  for  them  to  read  in  the 
same  way  they  would  be  read  at  the  central  facility,  and  those 
mammograms  were  interpreted  and  reported  in  the  same  manner 
(with  the  exception  of  the  use  of  computer-aided  detection).  This 
set  of  mammograms  was  not  specifically  selected  because  of 
suspicious  findings  by  the  technologists.  To  reduce  possible 
biases,  an  individual'not  involved  in  this  investigation  was  asked 
to  examine  summaries  of  time-dependent  recall  rates  for  all 
radiologists  in  our  practice  for  the  study  period.  A  different  team 
examined  all  cancers  detected  throughout  our  practice  as  a  result 
of  screening  mammography  during  the  same  period. 

During  the  study  period,  our  practice  performed  a  total  of 
115  571  screening  examinations  that  were  interpreted  by  24 
radiologists,  18  of  whom  interpreted  more  than  1000  mammo¬ 
grams  each.  All  radiologists  were  members  of  the  Breast  Imag¬ 
ing  Section  of  the  Department  of  Radiology  and  would  be 
considered  breast  imaging  specialists  in  an  academic  practice. 
We  also  repeated  our  analysis  by  using  only  data  for  the  seven 
highest  volume  radiologists,  all  of  whom  read  more  than  8000 
mammograms  each  over  a  3-year  period.  These  seven  radiolo¬ 


gists,  who  were  with  our  institution  throughout  the  study  period, 
performed  the  most  readings,  both  with  and  without  computer- 
aided  detection  assistance. 

For  the  purpose  of  computing  recall  rates,  mammograms 
were  considered  to  be  positive  if  recall  for  additional  imaging 
evaluation  was  recommended  (i.e.,  mammograms  classified  as 
Breast  Imaging  Reporting  and  Data  System  [BI-RADS]  cate¬ 
gory  0)  and  negative  if  a  1-year  follow-up  was  recommended 
(i.e.,  mammograms  classified  as  either  BI-RADS  category  1  or 
2)  (21).  Radiologists  at  these  facilities  did  not  use  BI-RADS 
assessment  categories  3,  4,  or  5  for  screening  examinations. 
Positive  outcome  was  defined  as  breast  cancer  detected  as  a 
result  of  the  diagnostic  work-up  initiated  by  a  positive  screening 
mammogram. 

Computation  of  Mammography  Recall  Rates 

Recall  rates  for  each  radiologist  and  for  the  group  of  24 
radiologists  were  computed  directly  from  mammographic  inter¬ 
pretation  records.  In  all  of  our  analyses,  we  excluded  recom¬ 
mendations  for  recall  that  were  due  to  technical  reasons,  such  as 
image  artifacts  (<1%).  Recalls  due  to  palpable  findings  identi¬ 
fied  during  clinical  breast  examinations  performed  on  all  women 
by  the  technologist  were  included  in  our  analyses  because  the 
majority  of  these  findings  were  also  marked  on  the  mammo¬ 
grams.  Such  recalls  amounted  to  approximately  1%  of  the 
screening  examinations;  hence,  the  underlying  rates  attributable 
to  mammography  interpretations  alone  are  approximately  1% 
lower  than  those  reported  here.  The  women  in  this  group  of 
recalls  are  not  the  same  as  the  group  of  women  with  palpable 
findings  discovered  by  the  woman  herself  or  by  a  physician 
during  a  breast  physical  examination.  Women  in  the  latter  group 
were  scheduled  for  diagnostic  examinations  and  were  not  in¬ 
cluded  in  our  study.  In  our  practice,  palpable  findings  that  are 
discovered  by  the  technologists  are  noted  during  the  physical 
examination  and  the  procedure  continues  as  a  screening  exam¬ 
ination  (including  the  use  of  computer-aided  detection).  The 
interpreting  radiologists  are  aware  of  the  technologists’  findings 
and  recall  the  women  for  additional  procedures  as  needed.  We 
recognize  that  this  practice  may  not  be  a  common  one.  We 
assumed  that  the  effects  of  recalling  this  group  of  women  due  to 
palpable  findings,  if  any,  on  the  recall  rates  of  individual 
radiologists  would  be  proportional  to  the  overall  volume  of 
mammograms  read  by  each  radiologist;  hence,  it  should  not 
substantially  affect  the  results. 

A  small  percentage  (<4%)  of  the  examinations  in  our  prac¬ 
tice  classified  as  BI-RADS  category  0  were  scheduled  for  an 
interpretation  at  a  later  date  because  the  needed  comparison 
films  were  missing  during  the  originally  scheduled  interpreta¬ 
tion.  Those  cases  were  distributed  proportionally  to  the  volume 
of  mammograms  read  by  each  radiologist  and  were  included  in 
the  recall  rates  because  it  was  not  clear  how  many  of  them  would 
have  been  recalled  anyway. 

Each  mammography  examination  was  identified  in  our  data¬ 
base  as  to  whether  computer-aided  detection  was  used  during  the 
interpretation.  We  therefore  analyzed  the  data  according  to 
whether  cases  were  interpreted  with  computer-aided  detection. 

Computation  of  Breast  Cancer  Detection  Rates 

Breast  cancer  detection  rates  were  computed  as  follows:  For 
every  breast  cancer  detected,  we  found  the  most  recent  screening 
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mammogram  that  identified  a  finding  that  led  to  a  diagnostic 
follow-up  and  ultimately  resulted  in  a  biopsy  that  was  positive 
for  cancer.  Only  the  interpreter  of  the  original  screening  mam¬ 
mogram  that  led  to  the  detection  of  breast  cancer  was  credited 
with  the  finding  (i.e.,  invasive  and  ductal  carcinoma  in  situ). 
Findings  of  lobular  carcinoma  in  situ  were  not  attributed  to  the 
interpreting  radiologist  as  a  cancer  detected  in  the  analyses.  If  a 
woman  was  recommended  for  a  biopsy  directly  as  a  result  of 
the  screening  examination,  the  interpreter  was  credited  with 
the  finding  as  well.  Cases  were  excluded  from  the  analysis  if 
the  most  recent  screening  mammogram  prior  to  biopsy  had 
been  performed  more  than  180  days  before  the  biopsy  or  if 
the  original  interpreter  had  not  recommended  a  recall  (i.e., 
false-negative  cases).  We  chose  a  cutoff  of  180  days  because 
we  have  found  that,  in  the  vast  majority  of  cases,  women  are 
lost  to  follow-up  or  ignore  the  recall  recommendation  alto¬ 
gether  if  the  recommended  follow-up  diagnostic  procedure  is 
not  scheduled  within  90  days  or  performed  within  180  days  of 
the  original  mammogram.  We  attributed  any  subsequent  find¬ 
ings  associated  with  recalls  for  diagnostic  work-ups  that  did 
not  take  place  within  180  days  of  the  original  mammogram  to 
the  subsequent  examination.  We  included  all  examinations 
that  that  had  been  originally  scheduled  as  screening  proce¬ 
dures  but  were  diagnosed  during  the  same  visit  and  during 
which  a  diagnosis  was  made  that  resulted  in  a  positive  out¬ 
come  (i.e.,  converted  into  a  diagnostic  procedure  that  led  to  a 
finding  of  cancer).  However,  these  cancer  cases  (n  =  30)  were 
excluded  from  the  computed  breast  cancer  detection  rates  in 
our  analysis  (both  nominator  and  denominator)  because  they 
were  all  diagnosed  by  a  radiologist  without  the  use  of 
computer-aided  detection,  and  we  therefore  could  not  deter¬ 
mine  whether  these  cases  would  have  been  detected  had  they 
undergone  routine  interpretation  (with  or  without  computer- 
aided  detection)  as  a  routine  screening  procedure.  In  addition, 
all  breast  cancer  patients  who  were  referred  to  us  from  other 
facilities  and  for  whom  the  diagnosis  did  not  originate  from  a 
screening  examination  done  at  one  of  our  facilities  were 
excluded  from  the  analysis. 

Statistical  Methods 

Recall  and  detection  rates  with  and  without  computer-aided 
detection  were  compared  by  using  a  generalized  estimating 
equations  (GEE)  logistic  regression  model  that  accounts  for 
clustering  of  findings  within  each  reader  (22).  In  addition,  we 
asked  an  independent  team  of  investigators  to  evaluate  the 
numbers  of  cancer  cases  that  were  detected  with  and  without 
computer-aided  detection  by  the  type  of  abnormality(s)  noted  in 
the  original  report.  Those  findings  were  assigned  to  one  of  the 
following  categories:  1)  mass(es)  only;  2)  clustered  microcalci¬ 
fications  only;  3)  mass(es)  and  clustered  microcalcifications;  and 


4)  other  findings.  Because  the  performance  levels  of  computer- 
aided  detection  systems  are  generally  outstanding  for  detecting 
microcalcifications  (16),  we  used  the  GEE  model  to  analyze  our 
findings  with  respect  to  possible  changes  in  the  percentage  of 
cancer  detections  attributable  to  microcalcification  clusters  as¬ 
sociated  with  the  use  of  computer-aided  detection.  In  addition, 
all  analyses  were  repeated  using  a  mixed-effect  logistic  regres¬ 
sion  model  in  which  readers  were  considered  a  random  effect, 
and  modality  (i.e.,  with  or  without  computer-aided  detection) 
was  considered  a  fixed  effect  (23).  We  also  examined  data  from 
the  seven  high-volume  radiologists  (i.e.,  those  who  interpreted 
more  than  8000  mammograms  each  during  the  study  period). 
Because  of  the  serial  nature  of  the  analysis  (namely,  this  was  not 
a  randomized  study),  we  repeated  the  analyses  with  respect  to 
the  timing  of  the  major  use  of  computer-aided  detection  in  our 
practice  by  comparing  the  results  for  all  cases  interpreted  with¬ 
out  computer-aided  detection  from  January  1,  2000,  through 
June  30,  2001,  when  computer-aided  diagnosis  was  used  in  only 
a  small  percentage  of  cases  (<0.2%)  at  our  facilities,  with  results 
for  all  cases  interpreted  with  computer-aided  detection  from 
October  1,  2001,  through  December  31,  2002,  when  most 
(>93%)  of  the  cases  at  our  facilities  were  interpreted  with 
computer-aided  detection.  All  statistical  tests  were  two-sided. 

Results 

The  mean  age  of  the  screened  population  (n  =  115  571) 
during  the  study  period  was  50.05  years  (standard  deviation  = 
11.17  years).  During  the  study  period,  the  percentage  of  women 
who  were  screened  for  the  first  time  gradually  decreased  from 
approximately  40%  in  2000  to  30%  in  the  last  quarter  of  2002, 
whereas  the  percentage  of  women  who  had  repeated  screenings 
gradually  increased. 

Table  1  summarizes  our  data  for  the  24  radiologists  who 
interpreted  screening  mammograms  at  our  facility  with  and 
without  the  use  of  a  computer-aided  detection  system.  Among 
the  115  571  examinations  in  our  database,  56  432  (48.8%)  were 
interpreted  without  the  use  of  the  computer-aided  detection 
system  and  59  139  (51.2%)  were  interpreted  with  the  use  of  the 
computer-aided  detection  system.  Recall  rates  for  the  entire 
group  of  24  radiologists  were  11.39%  for  mammograms  inter¬ 
preted  without  computer-aided  detection  and  1 1 .40%  for  mam¬ 
mograms  interpreted  with  it  (percent  difference  =  0.09,  95% 
confidence  interval  [Cl]  =  —  1 1  to  1 1 ;  P  —  .96).  Recall  rates  for 
the  18  radiologists  who  interpreted  more  than  1000  mammo¬ 
grams  each  during  the  study  period  ranged  from  7.7%  to  17.2% 
(data  not  shown).  Recall  rates  for  the  seven  high-volume  radi¬ 
ologists  who  interpreted  more  than  8000  mammograms  each 
during  the  study  period  ranged  from  7.7%  to  14.9%  (data  not 
shown).  Among  this  latter  group  of  radiologists,  there  was  no 


Tabic  1.  Mammography  recall  rates  and  breast  cancer  detection  rates  for  24  radiologists  performing  screening  mammograms  without  and  with 

computer-aided  detection* 


Type  of  interpretation 

No.  of 

mammograms  read 

No.  of 
recalls 

No.  of  breast 
cancers  detected 

Recall 
rate,  % 

Breast  cancer  detection  rate 
per  1000  mammograms  read 

Without  computer-aided  detection 

56  432 

6430 

197 

11.39 

3.49 

With  computer-aided  detection 

59  139 

6741 

210 

11.40 

3.55 

Total 

115  571 

13  171 

407 

11.40 

3.52 

*The  analysis  excluded  30  conversion  (screening  to  diagnostic)  cancer  cases,  all  of  which  were  interpreted  without  computer-aided  detection. 
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statistically  significant  correlation  (rho  =  -0.21,  P  -  .64) 
between  recall  rate  and  the  total  number  of  screening  mammo¬ 
grams  interpreted  by  individual  radiologists.  In  our  practice, 
approximately  3.0%  of  the  cases  recommended  for  recall  are 
typically  lost  to  follow-up  because  the  woman  either  undergoes 
re-screening  at  another  institution  or  ignores  our  recommenda¬ 
tions.  This  group  remained  relatively  constant  as  a  percentage  of 
recalled  women  over  the  period  in  question. 

Table  2  summarizes  our  data  for  the  seven  high- volume 
radiologists  who  interpreted  more  than  8000  screening  mam¬ 
mograms  each  with  and  without  the  use  of  a  computer-aided 
detection  system.  During  the  study  period,  these  radiologists 
interpreted  a  total  of  82  129  screening  mammograms  and 
were  credited  with  the  detection  of  292  breast  cancers  as  a 
result  of  these  screening  procedures.  In  this  group,  the  recall 
rates  decreased  from  11.62%  for  mammograms  interpreted 
without  computer-aided  detection  to  11.05%  for  mammo¬ 
grams  interpreted  with  computer-aided  detection  (percent  dif¬ 
ference  =  -4.9,  95%  Cl  =  -21  to  4;  P  =  .16). 

Breast  cancer  detection  rates  for  the  entire  group  of  24 
radiologists  were  3.49  per  1000  screening  examinations  for 
mammograms  interpreted  without  computer-aided  detection  and 
3.55  per  1000  screening  examinations  for  mammograms  inter¬ 
preted  with  it  (percent  difference  =  1.7,  95%  Cl  =  -11  to  19; 
P  =  .68)  (Table  1).  Breast  cancer  detection  rates  for  the  seven 
high-volume  radiologists  were  3.61  per  1000  screening  exami¬ 
nations  for  mammograms  interpreted  without  computer-aided 
detection  and  3.49  per  1000  screening  examinations  for  mam¬ 
mograms  interpreted  with  computer-aided  detection  (percent 
difference  -  -3.2,  95%  Cl  =  -15  to  9;  P  =  .54)  (Table  2). 

The  cancer  detection  rates  associated  with  recalls  due  to  the 
detection  of  clustered  microcalcifications  alone  were  1.35  per 
1000  mammograms  interpreted  without  computer-aided  detec¬ 
tion  and  1.44  per  1000  mammograms  interpreted  with  computer- 
aided  detection  (P  =  .66)  (data  not  shown).  We  observed  no 
trend  in  breast  cancer  detection  rates  over  time  when  we  re¬ 
viewed  average  detection  rates  for  all  24  radiologists  by  calendar 
quarter  (data  not  shown).  We  repeated  our  analyses  using  a 
random-effects  logistic  regression  model  and  found  that  there 
were  no  statistically  significant  changes  in  recall  rates  or  detec¬ 
tion  rates  for  all  measurements  presented  above.  Our  results 
were  not  substantially  affected  when  we  compared  only  mam¬ 
mograms  interpreted  without  computer-aided  detection  prior  to 
July  1,  2001,  with  only  those  interpreted  with  computer-aided 
detection  after  October  1,  2001. 

Discussion 

The  introduction  of  computer-aided  detection  into  our  prac¬ 
tice  was  not  associated  with  statistically  significant  changes  in 
recall  and  breast  cancer  detection  rates  for  the  entire  group  of 
radiologists  as  well  as  for  the  subset  of  seven  radiologists  who 


interpreted  high  volumes  of  mammograms.  The  magnitudes  of 
the  improvements  we  observed  were  substantially  less  than 
those  reported  in  the  literature  as  the  range  of  possible  improve¬ 
ments  based  on  retrospective  analyses  and  limited  prospective 
data  (9,17,18).  The  improvements  we  observed  may  be  attrib¬ 
utable  to  the  better  detection  of  clustered  microcalcifications 
associated  with  malignancy.  Our  findings  are  consistent  with  the 
range  of  improvement  in  detection  rates  estimated  and  reported 
by  others  (9,16-18).  However,  our  large  confidence  intervals 
reflect  the  relatively  low  number  of  breast  cancers  detected  with 
and  without  computer-aided  detection  and  the  large  inter-reader 
variability  among  the  radiologists  in  our  practice.  Because  there 
were  no  repeat  measures  in  this  database — that  is,  each  of  the 
examinations  was  interpreted  only  once  by  one  radiologist— we 
could  not  assess  intra-reader  variability. 

It  should  be  noted  that  we  could  not  provide  detailed  infor¬ 
mation  for  individual  radiologists  without  providing  individually 
traceable  data  because  each  staff  radiologist  knows  his  or  her 
reading  volume  and  approximate  recall  rate.  Our  data  are  not 
adjusted  for  any  learning  effect:  namely,  the  majority  of  inter¬ 
pretations  made  without  computer-aided  detection  occurred 
chronologically  prior  to  those  made  with  computer-aided  detec¬ 
tion.  We  also  did  not  account  for  any  effect  that  may  have 
resulted  from  a  continuous  effort  to  improve  performance  (in 
particular,  sensitivity)  by  group  reviews  of  all  false-negative 
cases  or  from  the  steps  undertaken  to  reduce  recall  rates  through 
various  actions,  such  as  monthly  performance  reviews  and  direct 
consultation  with  interpreters  who  had  higher-than-average  re¬ 
call  rates. 

Although  one  could  argue  that  some  or  all  of  the  reduction  in 
recall  rates  we  observed  for  the  high-volume  radiologists  may  be 
attributable  to  the  use  of  computer-aided  detection,  the  corre¬ 
sponding  decrease  in  cancer  detection  rates  we  observed  among 
the  radiologists  in  this  group  is  not  easily  explained  by  expected 
practice  variations.  An  assessment  of  whether  the  small  im¬ 
provement  we  observed  in  cancer  detection  is  due  to  learning 
effects — namely,  that  our  radiologists  had  substantially  more 
overall  experience  interpreting  mammograms  without  computer- 
aided  detection  than  with  computer-aided  detection— is  beyond 
the  scope  of  this  investigation. 

This  investigation  covered  a  period  during  which  conven¬ 
tional  film  mammography  was  performed  in  all  of  our  screening 
procedures.  Hence,  we  cannot  comment  on  the  possible  effect  of 
computer-aided  detection  in  a  digital  mammography  environ¬ 
ment.  In  our  study,  we  did  not  account  for  women  who  had 
decided  to  follow  up  on  our  recommendations  elsewhere.  How¬ 
ever,  because  compliance  in  patient  follow-up  was  relatively 
constant  during  the  study  period,  any  bias  in  the  results  due  to 
changes  in  patient  loss  to  follow-up  is  likely  to  be  small. 

There  are  limited  reported  data  concerning  the  actual  effect  of 
computer-aided  detection  on  breast  cancer  detection  and  mam- 


Tablc  2.  Mammography  recall  rates  and  breast  cancer  detection  rates  for  the  seven  high-volume  radiologists  performing  screening  mammograms  without  and 

with  computer-aided  detection 


Type  of  interpretation 

No.  of 

mammograms  read 

No.  of 
recalls 

No.  of  breast 
cancers  detected 

Recall 
rate,  % 

Breast  cancer  detection  rate 
per  1000  mammograms  read 

Without  computer-aided  detection 

44  629 

5188 

161 

11.62 

3.61 

With  computer-aided  detection 

37  500 

4145 

131 

11.05 

3.49 

Total 

82  129 

9333 

292 

11.36 

3.56 
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mography  recall  rates.  The  prospective  data  reported  by  Freer 
and  Ulissey  (16),  which  suggested  a  substantial  improvement 
(19.5%)  in  breast  cancer  detection  rates  associated  with  the  use 
of  computer-aided  detection  systems,  may  have  been  affected  by 
the  fact  that  the  results  of  mammographic  interpretations  without 
and  with  computer-aided  detection  were  reported  on  the  same 
cases  (i.e.,  mammograms  were  read  in  one  sitting,  first  without 
computer-aided  detection  then  immediately  afterward  with  the 
use  of  a  computer-aided  detection  system).  Another  prospective 
study  performed  in  a  similar  manner  reported  a  12%  improve¬ 
ment  in  detection  rates  associated  with  the  use  of  a  computer- 
aided  detection  system  (18).  This  type  of  protocol,  namely 
reading  mammograms  without  computer-aided  detection  fol¬ 
lowed  immediately  by  readings  of  the  same  mammograms  with 
the  use  of  a  computer-aided  detection  system  and  a  reassessment 
of  the  original  finding  without  computer-aided  detection,  may 
have  introduced  a  lower  level  of  vigilance  among  radiologists 
during  the  initial  interpretation  without  computer-aided  detec¬ 
tion,  because  they  knew  that  computer-aided  detection  would  be 
available  to  them  for  the  final  recommendation  and  that  the 
initial  interpretation  did  not  constitute  a  formal  clinical 
recommendation. 

Results  of  the  only  study  similar  to  ours,  albeit  on  a  substan¬ 
tially  smaller  group  of  patients  and  under  a  different  set  of 
circumstances,  suggested  that  computer-aided  detection  was  as¬ 
sociated  with  a  13%  improvement  in  breast  cancer  detection 
rates  (17).  One  of  the  advantages  of  the  approach  taken  in  our 
investigation  is  that  the  radiologists’  interpretations  were  per¬ 
formed  and  recorded  prospectively  in  a  clinical  setting  and  data 
were  collected  primarily  for  quality-assurance  purposes  (24). 

Our  results  for  the  interpretations  made  with  computer-aided 
detection  may  be  marginally  biased  because  the  outcomes  of  as 
many  as  nine  recommendations  for  recalls  and  three  recommen¬ 
dations  for  biopsies  during  the  last  quarter  of  2002  are  not  yet 
available.  Although  some  of  these  follow-up  procedures  or  bi¬ 
opsies  may  ultimately  be  performed  at  our  institution,  we  as¬ 
sume  that  the  women  who  underwent  the  original  mammograms 
have  been  lost  to  follow-up.  However,  on  the  basis  of  our  typical 
recall-to-cancer-detection  ratios  (approximately  1  of  32  cases) 
and  biopsy-to-confirmed  cancer  ratios  (approximately  1  of  5 
cases),  we  suspect  that  this  bias  would  not  substantially  affect 
our  findings  or  conclusions.  It  is  possible  that  the  gradually 
increasing  fraction  of  women  who  had  prior  screening  exami¬ 
nations  created  a  bias  in  our  results.  Repeat  screening  examina¬ 
tions  have  a  slightly  lower  number  of  cancers  present  as  more 
are  detected  during  the  first  screen,  and  on  average,  cancers 
detected  on  repeat  mammograms  may  be  more  “difficult”  to 
detect  because  more  of  the  “easier”  (e.g.,  larger)  cancers  are 
detected  during  the  initial  screen.  Repeat  mammograms  have  a 
lower  recall  rate,  as  the  radiologists  have  prior  films  for  com¬ 
parison,  to  help  inform  their  decision.  The  availability  of  prior 
examinations  for  comparison  (in  the  repeat  examinations) 
should  have  aided  in  the  interpretation  of  these  mammograms 
and  offset  the  possible  effect  (if  any)  on  the  interpretations  due 
to  an  increase  in  the  “average  case  difficulty.”  The  fact  that  our 
recall  rates  and  detection  rates  remained  virtually  constant  over 
time  suggests  that  the  possible  bias  due  to  a  gradual  increase  in 
repeat  examinations  is  not  a  statistically  significant  factor.  We 
suspect  that  this  increasing  availability  of  prior  examinations  for 
comparison  is  a  general  phenomenon  that  is  observed  by  most 
mammography  screening  practices  and  that  there  is  not  a  simple 


way  to  account  for  it  in  an  analysis  such  as  the  one  we  per¬ 
formed.  When  we  included  the  30  examinations  that  had  been 
originally  scheduled  as  screening  procedures  but  were  diagnosed 
during  the  same  visit  and  resulted  in  a  positive  outcome  in  the 
estimation,  our  actual  cancer  detection  rate  attributable  to 
screening  was  3.8  per  1000  examinations,  which  is  reasonable 
for  a  population  in  which  the  majority  of  women  had  undergone 
several  screening  procedures  prior  to  the  study  period  (19). 

On  the  basis  of  published  performance  levels  of  other 
computer-aided  detection  systems  (25),  we  believe  that  our 
results  are  not  unique  to  the  specific  computer-aided  detection 
system  that  is  used  at  our  institution.  It  is  possible,  however,  that 
in  clinical  practices  with  substantially  lower  recall  rates  than 
ours,  computer-aided  detection  would  have  larger  effects  on 
mammography  recall  rates  and  detection  rates  than  what  we 
observed.  Such  an  improvement  in  detection  rates  would  be 
consistent  with  results  of  a  study  (17)  that  reported  lower  recall 
rates  without  computer-aided  detection  (8.02%)  than  with 
computer-aided  detection  (8.43%). 

The  financial  implications  of  our  findings  are  beyond  the 
scope  of  this  work.  However,  a  simple  assessment  of  the 
additional  estimated  cost  of  using  computer-aided  detection 
per  additional  cancer  detected  in  our  practice  (approximately 
$150  000  per  additional  detected  cancer,  assuming  a  reim¬ 
bursement  rate  of  $10  per  case  for  professional  and  technical 
components  combined)  clearly  indicates  that  more  rigorous 
evaluations  of  the  cost  effectiveness  of  this  practice  are  needed. 

Our  observations  with  respect  to  recall  and  detection  rates 
may  be  exceptions  (stemming  from  large  inter-practice  varia¬ 
tions)  that  highlight  the  need  for  additional  recall  and  detection 
rate  data  from  multiple  clinical  practices  and  different  reading 
environments.  However,  until  such  data  clearly  demonstrate  that 
our  experience  is  indeed  an  exception,  these  results  represent  an 
important  first  step. 

This  analysis  of  our  practice  was  designed  to  assess  the 
changes,  if  any,  that  occurred  in  recall  and  breast  cancer  detec¬ 
tion  rates  with  the  introduction  of  computer-aided  detection.  Our 
results  suggest  that,  in  our  practice,  neither  recall  rates  nor  breast 
cancer  detection  rates  changed  with  the  introduction  of  this 
technology  at  its  current  level  of  performance,  particularly  as 
related  to  the  detection  of  abnormalities  other  than  clustered 
microcalcifications.  Due  to  large  confidence  intervals,  our  re¬ 
sults  are  statistically  consistent  with  the  possibility  of  large 
improvements  in  cancer  detection  rates  with  computer-aided 
detection.  Yet,  actually  observed  changes  in  our  practice  were 
substantially  lower  than  expected.  This  is  not  to  say  that  the  use 
of  computer-aided  detection  would  not  be  beneficial  or  cost- 
effective  in  other  practices.  Rather,  we  suggest  that,  at  its  current 
level  of  performance,  computer-aided  detection  may  not  im¬ 
prove  mammography  recall  or  breast  cancer  detection  rates 
(especially  as  related  to  the  detection  of  masses)  in  academic 
practices  similar  to  ours  that  employ  specialists  for  interpreting 
screening  mammograms. 
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BACKGROUND.  The  authors  investigated  the  correlation  between  recall  and  detec¬ 
tion  rates  in  a  group  of  10  radiologists  who  had  read  a  high  volume  of  screening 
mammograms  in  an  academic  institution. 

METHODS.  Practice-related  and  outcome-related  databases  of  verified  cases  were 
used  to  compute  recall  rates  and  tumor  detection  rates  for  a  group  of  10  Mam¬ 
mography  Quality  Standard  Act  (MQSA) -certified  radiologists  who  interpreted  a 
total  of  98,668  screening  mammograms  during  the  years  2000,  2001,  and  2002.  The 
relation  between  recall  and  detection  rates  for  these  individuals  was  investigated 
using  parametric  Pearson  (r)  and  nonparametric  Spearman  (rho)  correlation  co¬ 
efficients.  The  effect  of  the  volume  of  mammograms  interpreted  by  individual 
radiologists  was  assessed  using  partial  correlations  controlling  for  total  reading 
volumes. 

RESULTS.  A  wide  variability  of  recall  rates  (range,  7.7-17.2%)  and  detection  rates 
(range,  2.6-5.4  per  1000  mammograms)  was  observed  in  the  current  study.  A 
statistically  significant  correlation  ( P  <  0.05)  between  recall  and  detection  rates 
was  observed  in  this  group  of  10  experienced  radiologists.  The  results  remained 
significant  (P  <  0.05)  after  accounting  for  the  volume  of  mammograms  interpreted 
by  each  radiologist. 

CONCLUSIONS.  Optimal  performance  in  screening  mammography  should  be  eval¬ 
uated  quantitatively.  The  general  pressure  to  reduce  recall  rates  through  "practice 
guidelines"  to  below  a  fixed  level  for  all  radiologists  should  be  assessed  carefully. 
Cancer  2004;100:1590-4.  ©  2004  American  Cancer  Society. 

KEYWORDS:  mammography,  screening,  tumor  detection  rates,  recall  rates. 

As  periodic  mammographic  screening  is  rapidly  gaining  accep¬ 
tance,  our  understanding  of  many  strategic,  operational,  and  fi¬ 
nancial  issues  related  to  this  practice  is  improving  as  well.  Several 
performance  indices  have  been  used  to  define  “optimal”  practice 
parameters  in  screening  mammography.  These  include,  but  are  not 
limited  to,  sensitivity,  specificity,  positive  predictive  value  (PPV),  and 
cost  per  detected  tumor.1,2  Clearly,  the  focus  of  screening  for  early 
detection  should  primarily  be  on  improved  sensitivity.  At  the  same 
time,  the  large  number  of  patients  being  recalled  for  additional  pro¬ 
cedures  as  a  result  of  an  initial  review  is  a  recognized  problem  for  the 
very  same  reasons  (operational  and  financial),  with  the  added  con¬ 
cern  of  the  well  documented  increase  of  anxiety  levels  in  women  who 
are  recalled.3,4  Therefore,  there  is  a  belief  that  through  a  variety  of 
actions  including  but  not  limited  to  specific  and  targeted  training,  one 
can  augment  observer  performance  levels,  including  the  reduction  of 
recall  rates  in  screening  mammography.5,6  Although  not  specifically 
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regulated,  there  is  a  publicly  stated  goal  to  reduce 
recall  levels  to  <  10%.5,7  The  question  of  what  effect,  if 
any,  does  a  forced  reduction  in  recall  rates  have  on 
detection  rates  remains  somewhat  controversial. 
Some  studies  suggest  that  recall  and  detection  rates 
are  not  highly  correlated  (particularly  at  high  recall 
rates);  hence,  a  reduction  in  the  former  does  not  nec¬ 
essarily  affect  the  latter.6,8  Other  researchers  believe 
that,  after  appropriate  training,  highly  experienced  ra¬ 
diologists  individually  operate  largely  along  a  single 
receiver  operating  characteristic  curve;  hence,  pres¬ 
suring  them  to  reduce  their  recall  rate  may  result  in  a 
corresponding  reduction  in  the  detection  rates  as 
well.2,9  Because  of  the  well  documented  variability 
among  radiologists,  the  latter  effect  and  its  possible 
magnitude  have  to  our  knowledge  been  investigated 
only  recently.10"13  This  type  of  an  investigation  is  not 
easy  to  perform,  because  the  expected  yield  (detection 
of  actually  positive  cases  that  result  from  the  screen¬ 
ing)  has  been  reported  to  be  quite  low  in  a  population 
of  women  who  already  have  been  screened  in  the 
past.14,15  Therefore,  one  generally  needs  to  evaluate 
detection  rates  from  the  data  of  large  groups  of  indi¬ 
vidual  radiologists  pooled  together  or  have  access  to 
sufficient  data  from  radiologists  who  each  have  inter¬ 
preted  a  large  number  of  mammograms.  In  this  arti¬ 
cle,  we  present  an  analysis  of  the  latter  type  of  inves¬ 
tigation. 

MATERIALS  AND  METHODS 

Screening  mammography  examinations  performed  in 
the  study  facilities  at  Magee-Womens  Hospital  (of  the 
University  of  Pittsburgh  Medical  Center)  and  its  five 
satellite  breast  imaging  clinics  during  the  years  2000, 
2001,  and  2002  were  reviewed  under  an  Institutional 
Review  Board-approved  protocol.  Mammograms  that 
had  been  interpreted  by  the  10  highest  volume  mam- 
mographers  at  the  study  institution  during  this  period 
were  included  in  the  current  study. 

The  data  sources  used  in  the  current  analysis  were 
databases  of  procedure  scheduling,  procedure  com¬ 
pletion,  radiology  reporting,  and  procedure-related 
outcomes  as  determined  from  pathology  reports. 
These  databases  have  been  assembled  from  original 
reports  for  several  reasons,  including  quality  assur¬ 
ance  purposes  that  are  required  by  the  Mammography 
Quality  Standard  Act  (MQSA). 16,17  The  computerized 
reporting  system  and  data  entry  protocols  used  in  our 
practice  remained  the  same  throughout  the  study  pe¬ 
riod.  Because  the  number  of  positive  findings  leading 
to  the  detection  of  tumors  by  each  individual  were 
low,  the  records  of  all  mammograms  read  by  each  of 
the  participating  radiologists  “with"  and  "without”  the 
availability  of  results  from  a  commercial  Computer- 


Assisted  Detection  (CAD)  system  were  pooled  for  the 
purpose  of  this  analysis.  Our  clinical  practice  for 
screening  mammography  during  this  period  was  film 
based,  and  most  screening  mammograms  were  read  at 
the  main  facility  in  a  batch  mode.  We  included  in  the 
current  analysis  the  results  from  the  interpretations  of 
the  10  highest  volume  radiologists  in  our  practice, 
most  of  whom  were  with  the  study  institution 
throughout  much  of  the  period  in  question.  Each  has 
performed  >  3500  interpretations  of  screening  mam¬ 
mography  examinations. 

Recall  rates  for  each  radiologist  were  computed 
directly  from  mammography  interpretation  records 
(Breast  Imaging  Reporting  and  Data  System  Atlas  [BI¬ 
RADS®  Atlas;  American  College  of  Radiology,  Reston, 
VA]  rating  of  0).  We  excluded  recommendations  for 
recall  due  to  technical  reasons  ("technical  recalls”). 
These  account  for  approximately  1%  of  cases.  How¬ 
ever,  recalls  resulting  from  palpable  findings  during 
clinical  breast  examinations  were  included  because 
the  majority  of  these  findings  also  were  depicted  in  the 
mammograms.  These  findings  amount  to  <  1%  of 
examinations;  therefore,  the  underlying  rates  attribut¬ 
able  to  mammography  interpretations  alone  are  ac¬ 
cordingly  somewhat  lower  than  those  reported  in  the 
current  study.  The  effect  of  "palpable”  findings  on 
individual  radiologists  is  expected  to  be  distributed 
proportionally  to  their  overall  volume. 

In  our  practice,  the  interpretation  of  some  exam¬ 
inations  (<  4%)  is  delayed  because  of  missing  com¬ 
parison  films  during  the  initial  interpretation.  These 
generally  are  distributed  proportionally  to  the  volume 
read  by  each  radiologist  and  are  included  in  the  recall 
rates  because  it  is  not  clear  how  many  of  these  cases 
would  have  been  actually  recalled  in  any  case. 

Tumor  detection  rates  were  computed  as  follows. 
We  identified  the  latest  screening  examination  for 
each  detected  tumor  that  resulted  in  a  diagnostic  fol¬ 
low-up  (recall)  and  ultimately  resulted  in  pathologi¬ 
cally  verified  carcinoma.  The  radiologist  who  inter¬ 
preted  the  screening  mammogram  that  led  to  the 
detection  of  breast  carcinoma  was  credited  with  the 
finding  for  the  purposes  of  the  current  analysis.  Cases 
were  excluded  from  the  analysis  if  the  latest  screening 
mammogram  prior  to  biopsy  had  been  performed 
>180  days  earlier.  In  our  experience,  these  women 
generally  are  "lost”  to  follow-up  at  other  institutions 
or  ignore  the  recommendations  for  a  diagnostic 
workup  (recall)  altogether.  Cancer  patients  who  were 
referred  to  us  from  other  facilities  and  for  whom  the 
diagnosis  did  not  originate  from  a  screening  examina¬ 
tion  in  one  of  our  facilities  were  excluded.  Women 
who  originally  were  presented  as  screening  proce¬ 
dures  but  were  diagnosed  using  additional  radio- 
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graphic  procedures  or  other  modalities  (e.g.,  ultra¬ 
sound)  during  the  same  visit  (“conversion”  cases  from 
screening  to  diagnostic)  were  accounted  for  and  were 
included  in  the  current  analysis.  However,  because  a 
substantial  number  of  these  may  originally  have  been 
identified  as  “potentially  abnormal”  by  a  technologist 
(who  personally  shows  the  case  to  a  radiologist)  dur¬ 
ing  a  quality  assurance  review  of  the  images,  we  re¬ 
peated  the  analysis  after  excluding  this  group  of  cases. 
For  the  purpose  of  these  analyses,  we  assume  that  any 
effect  due  to  the  performance  level  of  the  radiologists 
who  were  performing  and  interpreting  the  diagnostic 
procedures  during  the  follow-up  visit  are  distributed 
in  a  manner  that  does  not  affect  the  study  conclu¬ 
sions.  The  radiologists  could  not  select  the  examina¬ 
tions  they  interpreted  in  our  practice. 

The  correlation  between  recall  and  detection  rates 
was  evaluated  using  both  the  parametric  Pearson  (r) 
and  the  nonparametric  Spearman  (rho)  correlation 
coefficients.  We  also  examined  the  results  after  partial 
correction  for  the  total  volume  of  mammograms  in¬ 
terpreted  by  each  radiologist  during  the  period  in 
question. 

RESULTS 

Recall  and  detection  rates  for  the  10  radiologists 
whose  data  were  analyzed  in  the  current  study  were 
computed.  Each  performed  >  3500  interpretations 
(range,  3605-16,128  interpretations)  during  the  pe¬ 
riod  in  question.  We  were  unable  to  publish  detailed 
information  for  individual  radiologists  without  pro¬ 
viding  individually  traceable  data  because  each  staff 
radiologist  is  aware  of  the  approximate  volume  of 
screening  examinations  they  interpreted  and  their 
approximate  recall  rate.  These  10  radiologists  inter¬ 
preted  a  total  of  98,668  cases  during  this  time  and 
detected  368  cases  of  carcinoma.  Twenty-six  “con¬ 
version”  cases  were  included  in  the  analysis.  These 
cases  originally  were  presented  as  a  screening  pro¬ 
cedure  but  the  patients  underwent  “follow-up”  pro¬ 
cedures  (e.g.,  ultrasound)  during  the  same  visit  (be¬ 
cause  of  a  physician  being  present  on  site  at  the 
time  of  the  visit).  A  wide  range  of  recall  rates  (range, 
7.7-17.2%)  and  detection  rates  (range, 2. 6-5. 4  per 
1000  mammograms)  was  observed.  Despite  the  low 
number  of  radiologists  (10),  when  recall  and  detec¬ 
tion  rates  were  compared  using  the  parametric 
Pearson  (r)  correlation  coefficient,  the  correlation 
between  the  recall  and  detection  rates  was  signifi¬ 
cant  (r  =  0.76;  P  =  0.01).  Similarly,  a  significant 
correlation  was  observed  in  the  group  of  radiologists 
using  the  nonparametric  Spearman  correlation  co¬ 
efficient  (rho  =  0.72;  P  =  0.02).  A  linear  least  square 
fit  between  the  recall  and  detection  rates  for  the 
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FIGURE  1  .  A  linear  fit  of  detection  rates  as  a  function  of  recall  rates  for  the 
10  radiologists  in  the  current  study. 

group  in  which  each  radiologist  represents  a  single 
“operating  point”  is  presented  in  Figure  1.  Despite 
significant  interreader  variability,  the  slope  indi¬ 
cates  an  average  of  0.22  additional  detections  per 
1%  increase  in  recall  rates  (95%  confidence  interval 
on  the  slope  is  +0.068  to  +0.378).  The  correlation 
between  recall  and  detection  rates  remained  signif¬ 
icant  ( P  <  0.05)  after  accounting  for  the  total  vol¬ 
ume  read  by  each  radiologist  using  partial  correla¬ 
tions.  Repeated  analyses  after  the  exclusion  of  the 
26  “conversion”  cases  indicated  no  substantial  dif¬ 
ference  in  the  correlations  reported  herein.  The  cor¬ 
relations  remained  significant  when  the  analysis 
was  repeated  for  the  7  (P  =  0.05),  8  (P  <  0.05),  and 
9  (P  <  0.05)  highest  volume  radiologists.  These  re¬ 
sults  demonstrate  that,  in  general,  in  our  practice, 
the  higher  the  recall  rates,  the  higher  the  detection 
rates.  This  increase  in  detection  rate  was  found  to 
persist  over  the  range  of  observed  recall  rates  and 
extended  beyond  the  currently  recommended  prac¬ 
tice  guideline  of  10%. 

DISCUSSION 

There  is  little  doubt  that  continuing  education  and 
training  are  important  factors  in  the  ability  of  radi¬ 
ologists  to  be  consistent  in  interpreting  mammo¬ 
grams  and  to  improve  their  overall  performance. 
However,  to  our  knowledge,  there  are  no  conclusive 
data  published  to  date  regarding  to  what  extent 
improvement  continues  beyond  a  certain  level  of 
training  or  experience.12  Although  there  are  ques¬ 
tions  with  regard  to  whether  volume  and  experience 
affect  performance,12  the  general  belief  has  been 
that  one  can  reduce  recall  rates  relatively  easily 
without  a  significant  impact  on  detection  rates.  As  a 
result,  there  is  an  ongoing  significant  effort  to  do  so, 
particularly  in  practices  similar  to  ours  with  recall 
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rates  that  are  in  the  higher  range  (s  10%).  PPV  as  a 
result  of  screening  has  been  of  great  interest  as  one 
of  the  indicators  of  the  performance  level  of  radiol¬ 
ogists  in  this  area.8  However,  if  sensitivity  is  affected 
by  recall  rates,  particularly  in  a  group  of  well 
trained,  high-volume  radiologists  whose  recall  rates 
are  relatively  high,  the  fundamental  question  of 
whether  to  continually  pressure  them  to  reduce 
their  recall  rates  following  currently  accepted  prac¬ 
tice  guidelines  remains.  This  stems  from  the  fact 
that  the  detection  of  “earlier  tumors”  with  higher 
recall  rates  may  be  as  or  perhaps  more  important 
than  actually  reducing  the  recall  rates  or  improving 
the  PPV  somewhat.  It  is  interesting  to  note  that  an 
important  review  of  several  related  issues  suggested 
observations  that  were  similar  to  those  of  the  cur¬ 
rent  study.10  Unfortunately,  to  our  knowledge  the 
radiology  community  has  not  objectively  addressed 
this  potentially  important  matter  to  date. 

Similar  to  the  findings  reported  by  Yankaskas  et 
al.8,  the  results  of  the  current  study  suggest  that  de¬ 
tection  rates  generally  are  affected  by  recall  rates  in 
the  lower  range.  However,  unlike  the  observations  of 
Yankaskas  et  al.,8  the  effect  in  our  group  of  10  highly 
trained  radiologists,  who  individually  read  a  reason¬ 
ably  high  volume  of  mammograms,  persisted  over  the 
entire  range  of  observed  recall  rates  (as  high  as  17%). 
In  the  higher  range  of  recall  rates  (>  7%),  Yankaskas  et 
al.8  showed  no  correlation  between  the  recall  and 
detection  rates.  Therefore,  their  results  could  suggest 
that,  in  this  critical  range,  a  reduction  in  recall  rates 
should  not  affect  the  detection  rates.  It  is  possible  that 
this  difference  arises  from  the  fact  that  the  current 
study  took  place  in  a  "reasonably  stable”  screening 
population  in  whom  the  majority  of  “prevalence  (or 
“baseline")  carcinomas”  had  been  detected  already. 
Another  possible  explanation  may  be  the  number  of 
mammograms  interpreted  by  individual  radiologists 
in  the  two  studies.  Clearly,  more  data  are  needed  in 
this  regard. 

The  total  number  of  mammography  screening  in¬ 
terpretations  by  the  radiologist  with  the  lowest  screen¬ 
ing  volume  reported  herein  over  a  3-year  period  was 
relatively  low.  However,  our  regionwide  referral  base 
was  found  to  result  in  a  large  number  of  other  diag¬ 
nostic  and  interventional  breast-imaging  procedures 
that  typically  amount  to  approximately  50%  of  the 
screening  examinations.  Hence,  our  radiologists 
should  be  considered  as  “specialists”  in  breast  imag¬ 
ing. 

It  should  be  noted  that  in  our  practice  the  average 
recall  rates  (» 1 1  percent)  are  generally  relatively  high 
compared  with  some  reports,1819  and  they  are  in  bet¬ 
ter  agreement  with,  and  in  some  cases  lower  than, 


others. 15,20,21  We  have  no  simple  explanation  for  this 
observation.  The  results  of  the  current  study  are  in 
agreement  with  the  findings  of  Beam  et  al.12  and  oth¬ 
ers  in  that  there  is  a  large  variability  in  the  perfor¬ 
mance  of  the  radiologists  in  this  area.  We  did  not 
detect  a  significant  correlation  between  the  volume 
read  by  the  individual  radiologists  during  the  period  in 
question  and  their  performance  level,  although  the 
radiologists  in  the  current  study  all  can  be  considered 
high  volume,  "well  trained"  readers  with  significant 
experience.  There  are  several  arguments  one  can  raise 
with  regard  to  why  the  estimated  recall  and  detection 
rates  in  the  current  study  may  not  be  precise  in  terms 
of  absolute  values.  These  include  but  are  not  limited 
to  the  inclusion  of  palpable  cases  and  incomplete 
follow-up  of  cancer  patients  who  may  be  lost  to  other 
institutions.  The  fact  that  our  primary  area  of  interest 
is  the  relative  performance  levels  of  the  radiologists 
(rather  than  absolute)  makes  the  results  valid  despite 
these  limitations,  as  long  as  one  does  not  bias  the 
interpretation  process  by  selectively  assigning  a  spe¬ 
cific  subset  to  be  interpreted  by  one  radiologist  or 
another  (e.g.,  all  “high  risk”  women  or  all  examina¬ 
tions  of  women  with  dense  breasts  are  assigned  to 
“conservative"  or  "high-volume”  radiologists).  This 
was  clearly  not  the  case  in  our  practice.  Therefore,  one 
would  expect  that  any  related  corrections  as  a  result  of 
these  limitations  would  be  largely  proportional  to  the 
volume  of  cases  interpreted  by  each  radiologist  in  the 
course  of  their  routine  clinical  practice.  The  correla¬ 
tion  between  detection  rates  and  outcome  or  even 
“average  stage  of  disease  ”  at  the  time  of  detection  is 
beyond  the  scope  of  this  project  because  the  number 
of  tumors  detected  by  an  individual  radiologist  was 
too  small  and  the  follow-up  time  after  detection  too 
short  to  meaningfully  assess  differences,  if  any,  in 
outcome. 

The  results  of  the  current  study  suggest  that  be¬ 
fore  we  unilaterally  pressure  radiologists  to  reduce 
their  recall  rates  because  of  a  notion  that  this  will 
improve  our  practices  (and  reduce  overall  manage¬ 
ment  costs),  we  need  to  carefully  evaluate  the  impact 
such  an  effort  may  have  on  early  (and  perhaps  even 
“earlier”)  detection.  If  we  believe  that  screening 
should  focus  primarily  on  maximizing  early  detection, 
and  the  earlier  the  better,  one  has  to  consider  whether 
there  maybe  an  individualized  optimal  operating  level 
that  should  be  considered,  rather  than  a  “globally" 
recommended  practice  guideline  of  a  maximum  "ac¬ 
ceptable”  recall  rate  that  applies  to  all  screening  mam- 
mographers.  This  view  may  be  supported  by  women 
who  appear  to  strongly  prefer  a  small  increase  in  de¬ 
tection  rates,  even  at  the  expense  of  higher  recall  rates 
and  the  associated  impact  in  terms  of  cost  and  added 
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anxiety.22-24  The  current  limited  study  included  a 
group  of  10  academic  radiologists  practicing  at  1  in¬ 
stitution  under  1  set  of  practice  conditions.  Clearly, 
more  data  are  required  before  one  can  generalize  the 
findings  reported  herein  to  the  population  of  radiolo¬ 
gists  who  interpret  screening  mammography  in  this 
country.  At  the  same  time,  the  number  and  type  of 
examinations  used  in  the  current  analysis  may  be 
generalizable  to  the  screening  population  in  a  large 
number  of  academic  practices  around  the  U.S. 

Conclusions 

The  performance  level  of  a  radiologist  in  the  screening 
environment  is  a  complex,  multifactorial  issue  that 
cannot  and  should  not  be  simplified.  Reducing  recall 
rates  by  “decree”  (through  the  enforcement  of  recom¬ 
mended  practice  guidelines)  may  result  in  a  corre¬ 
sponding  reduction  in  the  detection  rates,  hence  the 
associated  delays.  The  impact  of  external  pressure  on 
individual  radiologists  to  reduce  their  recall  rates 
should  be  evaluated  carefully. 
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PURPOSE:  To  compare  performance  of  two  computer-aided  detection  (CAD) 
systems  and  an  in-house  scheme  applied  to  five  groups  of  sequentially  acquired 
screening  mammograms. 

MATERIALS  AND  METHODS:  Two  hundred  nineteen  film-based  mammographic 
examinations,  classified  into  five  groups,  were  included  in  this  study.  Group  1 
included  58  examinations  in  which  verified  malignant  masses  were  detected  during 
screening;  group  2,  39  in  which  all  available  latest  examinations  were  performed 
prior  to  diagnosis  of  these  malignant  masses  (subset  of  39  women  from  group  1); 
group  3,  22  in  which  findings  were  interpreted  as  negative  but  were  verified  as 
cancer  within  1  year  from  the  negative  interpretation  (missed  cancers);  group  4,  50 
in  which  findings  were  negative  and  patients  were  not  recalled  for  additional 
procedures;  and  group  5,  50  in  which  patients  were  recalled  for  additional  proce¬ 
dures  and  findings  were  negative  for  cancer.  In  all  examinations,  images  were 
processed  with  two  Food  and  Drug  Administration-approved  commercially  avail¬ 
able  CAD  systems  and  an  in-house  scheme.  Performance  levels  in  terms  of  true¬ 
positive  detection  rates  and  number  of  false-positive  identifications  per  image  and 
per  examination  were  compared. 

RESULTS:  Mass  detection  rates  in  positive  examinations  (group  1)  were  67%-72%. 
Detection  rates  among  three  systems  were  not  significantly  different  (P  >  .05).  In  50 
negative  screening  examinations  (group  4),  false-positive  rates  ranged  from  1 .08  to 
1 .68  per  four-view  examination.  Performance  level  differences  among  systems  were 
significant  for  false-positive  rates  (P  =  .008).  Performance  of  all  systems  was  at  levels 
lower  than  publicly  suggested  in  some  retrospective  studies.  False-positive  CAD 
cueing  rates  were  significantly  higher  for  negative  examinations  in  which  patients 
were  recalled  (group  5)  than  they  were  for  those  in  which  patients  were  not  recalled 
(group  4)  (P  s  .002). 

CONCLUSION:  Performance  of  CAD  systems  for  mass  detection  at  mammography 
varies  significantly,  depending  on  examination  and  system  used.  Actual  perfor¬ 
mance  of  all  systems  in  clinical  environment  can  be  improved. 
c  RSNA,  2004 
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An  increasing  body  of  evidence  suggests  that  early  detection  of  breast  cancer  through 
periodic  screening  is  beneficial  (1,2).  Mammographic  screening  is  rapidly  gaining  accep¬ 
tance  worldwide,  and  the  number  of  procedures  performed  continues  to  increase  (3,4).  The 
difficulty  in  identification  of  some  subtle  suspicious  regions  depicted  on  mammograms, 
particularly  those  related  to  masses  and  asymmetric  densities,  the  repetitious  and  some¬ 
times  tedious  nature  of  the  task,  and  the  shortage  of  experienced  radiologists  who  spe¬ 
cialize  in  breast  imaging  and  who  routinely  read  high  volumes  of  images  in  examinations 
have  resulted  in  a  wide  variability  in  observer  performance  levels,  as  well  as  in  relatively 
high  recall  rates  for  additional  procedures  (5-7).  The  effectiveness  of  mammographic 
screening  programs  depends  on  many  factors.  These  factors  include,  but  are  not  limited  to, 
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the  expertise  and  judgmental  ability  of 
the  radiologist  who  reads  the  mammo¬ 
gram.  Variability  among  radiologists  can 
actually  be  useful  insofar  as  studies  show 
that  double  reading,  namely,  having  two 
radiologists  read  the  same  mammogram 
independently,  could  increase  detection 
by  as  much  as  15%  (8).  Even  if  there  were 
no  shortage  of  experienced  radiologists, 
however,  the  cost  of  true  double  reading 
as  a  standard  practice  is  prohibitive  for 
most  facilities. 

In  recent  years,  major  efforts  have  been 
expended  to  develop  computer-aided  de¬ 
tection  (CAD)  systems  that  will  help  ra¬ 
diologists  with  breast  cancer  detection. 
The  hope  is  that  these  systems  will  serve 
as  a  second  reader  and  will  help  improve 
sensitivity  without  a  substantial  increase 
in  recall  rates  and  at  the  same  time  pos¬ 
sibly  decrease  reader  variability,  as  well. 
These  systems  are  currently  aimed  at 
the  early  detection  of  cancer  and  are 
accordingly  designed  to  assist  the  radi¬ 
ologist  in  detection  of  suspicious  re¬ 
gions  depicted  as  clustered  microcalci¬ 
fications  and  masses  (9-11).  Computer- 
aided  diagnosis  systems  are  also  being 
developed  to  assist  radiologists  in  the 
classification  task,  namely,  the  determi¬ 
nation  of  whether  or  not  an  identified 
finding  is  likely  to  represent  a  malig¬ 
nancy  (11-13).  The  Food  and  Drug  Ad¬ 
ministration  has  approved  several  detec¬ 
tion  systems  for  routine  clinical  use,  and 
Medicare  and  other  insurance  companies 
have  approved  reimbursement  for  their 
use  in  clinical  practice. 

Results  of  studies  (14,15)  suggest  that 
the  use  of  CAD  systems  could  potentially 
increase  cancer  detection  rates  by  as 
much  as  20%  without  a  significant  in¬ 
crease  in  recall  rates.  To  date,  there  are 
limited  data  on  the  actual  effect  of  the 
prospective  use  of  such  systems  in  the 
clinical  environment  (16,17).  There  is 
some  evidence  that  the  performance  of 
radiologists,  at  least  in  the  laboratory  set¬ 
ting,  is  affected  by  the  performance  of 
the  CAD  scheme  itself  (18).  Hence,  a  high 
level  of  performance  is  an  important  fac¬ 
tor  in  the  ultimate  clinical  success  of 
CAD. 

Data  for  comparison  of  the  perfor¬ 
mance  of  CAD  systems  applied  to  the 
same  set  of  cases  are  limited  (19-22).  The 
purpose  of  our  study,  therefore,  was  to 
compare  the  performance  of  two  FDA- 
approved  commercially  available  CAD 
systems  and  an  in-house- developed 
scheme  in  five  groups  of  sequentially  ac¬ 
quired  screening  mammograms. 


MATERIALS  AND  METHODS 
Screening  Examination  Groups 

Screening  mammographic  examina¬ 
tions  performed  at  Magee-Womens  Hos¬ 
pital,  University  of  Pittsburgh  Medical 
Center,  and  at  its  five  satellite  breast-im¬ 
aging  clinics  during  2002  were  included 
in  this  study.  These  examinations  were 
classified  into  five  groups.  This  study  was 
conducted  with  an  institutional  review 
board-approved  protocol.  Informed  con¬ 
sent  was  waived.  Images  in  all  examina¬ 
tions  included  in  this  study  were  ac¬ 
quired  with  film  (MIN-R-2000;  Eastman 
Kodak,  Rochester,  NY)  and  were  clini¬ 
cally  interpreted  with  CAD  (Image- 
Checker;  R2  Technologies,  Sunnyvale, 
Calif)  as  a  part  of  our  routine  practice. 

The  data  sources  for  the  selection  of 
examinations  were  databases  of  proce¬ 
dure  scheduling,  procedure  completion, 
radiology  reporting,  and  procedure-re¬ 
lated  outcomes  as  determined  from  rele¬ 
vant  pathology  reports. 

Group  1  included  58  examinations 
performed  in  women  with  biopsy-proved 
cancer  that  initially  had  been  identified 
as  a  mass  by  a  radiologist  in  our  group 
during  a  screening  examination  in  2002. 
Images  were  selected  sequentially  from 
our  procedure-related  outcome  database 
by  a  staff  member  (J.S.S.)  who  did  not 
have  any  prior  knowledge  of  the  specific 
details  about  the  patient  or  of  the  visual 
characteristics  of  the  depicted  mass. 

In  addition,  there  was  an  interest  in 
the  performance  of  CAD  systems  applied 
to  examinations  performed  1  year  prior 
to  observation  of  a  positive  finding. 
Group  2  hence  included  39  available  lat¬ 
est  negative  prior  examinations  (subset 
of  39  women  from  group  1  who  under¬ 
went  a  different  examination  formed 
group  2)  performed  during  or  prior  to 
2001  that  had  been  performed  before  the 
screening  examination  that  led  to  a  find¬ 
ing  positive  for  cancer. 

Group  3  included  22  consecutive  false¬ 
negative  examinations  in  which  images 
depicted  masses  in  retrospect.  In  21  ex¬ 
aminations,  one  mass  in  each  was  de¬ 
picted  on  images,  and  in  one  examina¬ 
tion  two  masses  were  depicted,  which 
produced  a  total  of  23  masses.  Findings 
in  these  examinations  were  defined  in 
our  practice  as  false-negative  interpreta¬ 
tions.  Findings  in  these  examinations 
had  been  interpreted  as  negative  or  be¬ 
nign  (Breast  Imaging  Reporting  and  Data 
System  category  1  or  2)  during  the 
screening  examination  and  were  biopsy 
proved  as  positive  for  cancer,  with  a  mass 


depicted  on  subsequent  mammograms 
obtained  within  1  year  of  the  negative 
examination.  These  examinations  consti¬ 
tute  a  different  set  of  cases  and  are  not  a 
subset  of  the  39  prior  examinations  de¬ 
scribed  previously  as  group  2. 

Group  4  included  50  verified  negative 
examinations  (Breast  Imaging  Reporting 
and  Data  System  category  1  or  2)  that 
were  selected  randomly  by  the  same  staff 
member  who  selected  those  in  group  1 
from  the  examinations  performed  during 
two  preselected  dates  in  2002  (March  1 
and  2,  2002).  Findings  in  all  of  these  ex¬ 
aminations  were  verified  with  findings  at 
a  1-year  follow-up  screening  examina¬ 
tion  that  were  interpreted  as  negative.  A 
1-year  follow-up  examination  was  the 
latest  available  examination  in  these 
women. 

Group  5  included  50  consecutive  ex¬ 
aminations  in  which  patients  had  been 
recalled  during  April  2002  (Breast  Imag¬ 
ing  Reporting  and  Data  System  category 
0).  Results  of  the  diagnostic  work-up  that 
followed  were  negative  or  benign  (Breast 
Imaging  Reporting  and  Data  System  cat¬ 
egory  1  or  2),  and  results  of  the  work-up 
for  the  annual  examination  in  2003  were 
negative,  as  well. 

As  a  result,  a  total  of  219  examinations 
in  180  women  were  included  in  the 
study.  The  median  age  of  the  women 
whose  examinations  were  used  in  this 
study  was  54.5  years,  with  a  range  of 
38-87  years. 

Evaluation  of  Masses 

All  examinations  were  reviewed  by  sev¬ 
eral  investigators  (D.G.,  J.H.S.,  L.A.H., 
J.S.S.)  together  with  source  documents  to 
generate  a  truth  file  that  included  de¬ 
picted  findings  for  the  examinations  in 
question.  The  boundaries  of  the  masses 
were  drawn  subjectively  and  conserva¬ 
tively  (approximately  5  mm  larger  than 
the  depicted  masses  in  all  directions)  on 
the  image  obtained  at  the  examination 
performed  in  2002  that  resulted  in  the 
finding  and  on  the  corresponding  areas 
on  the  images  obtained  at  the  prior  ex¬ 
aminations,  when  applicable.  If  masses 
were  depicted  with  spiculations,  these 
were  included  in  the  mass  region.  Hence, 
the  allowed  target  was  larger  in  all  direc¬ 
tions  than  was  the  depicted  mass.  This 
selection  for  the  increased  size  of  the  tar¬ 
get  was  arbitrary  and  increased  the 
marked  regions,  in  some  cases  substan¬ 
tially,  because  mass  contours  with  the 
expectation  that  any  identification  (de¬ 
tection)  by  the  CAD  system  close  to  the 
actual  mass  would  not  be  disregarded  by 
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TABLE  1 

Fraction  of  Detected  Masses  according  to  Breast  and  Image  for  Biopsy-proved  Cancers 


Croup  and 
Examination  Type 

Detection  Fraction  according  to  Breast 

Detection  Fraction  according  to  Image 

No.  of 
Breasts  with 
Malignant 
Masses 

Second 
Look  (%) 

ImageChecker 

(%) 

In-house 

Scheme 

(%) 

No.  of  Visible 
Malignant 
Masses 

Second 

Look 

(%) 

ImageChecker 

(%) 

In-house 

Scheme 

(%) 

1,  True-positive 

58 

72  (42) 

71  (41) 

67  (39) 

114* 

56  (64) 

55  (63) 

2,  Prior  to  true-positive 

39 

23  (9) 

15(6) 

78t 

14(11) 

14(11) 

jljHr  18 

3,  False-negative 

23 

35(8) 

39  (9) 

45t 

27  (12) 

27  (12) 

KihI 

Note.— Only  the  breast  in  which  cancer  was  found  was  included  in  the  calculations.  Numbers  in  parentheses  were  used  to  calculate  the  percentages. 
*  One  patient  had  two  malignant  masses  in  the  same  breast.  In  four  examinations,  the  malignant  mass  was  visible  on  only  one  mammographic  view, 
t  One  patient  had  two  malignant  masses  in  the  same  breast.  In  two  examinations,  the  malignant  mass  was  visible  on  only  one  mammographic  view 
during  the  later  examination. 

t  One  patient  had  one  malignant  mass  in  both  breasts.  In  one  examination,  the  malignant  mass  was  visible  on  only  one  mammographic  view  during 
the  later  examination. 


the  interpreting  radiologists.  It  also  al¬ 
lowed  position  changes  at  the  prior  ex¬ 
amination  to  be  more  conservatively  ac¬ 
counted  for  because  of  the  larger  allowed 
target  for  detection. 

For  each  examination,  processing  was 
performed  with  three  CAD  systems.  One 
system  (ImageChecker  M1000,  version 
3.1;  R2  Technologies)  was  used  routinely 
in  our  clinical  practice  and  was  the  sys¬ 
tem  with  which  processing  had  been  per¬ 
formed  in  all  of  the  examinations  during 
the  original  clinical  interpretation.  An¬ 
other  system  (Second  Look,  version  6.0 
Beta;  CADx  Systems,  Beavercreek,  Ohio) 
was  used  to  process  all  images  as  well.  A 
third  system  was  an  in-house- developed 
scheme,  and  its  use  has  been  reported  in 
the  past  (23-25). 

To  ensure  that  there  was  no  bias  in  the 
results,  with  the  exception  of  the  fact 
that  the  initial  selection  may  have  been 
affected  somewhat  by  the  use  of  the  sys¬ 
tem  that  we  used  during  the  initial  clin¬ 
ical  interpretation,  we  fixed  the  detection 
threshold  for  determination  of  suspi¬ 
cious  regions  on  the  in-house  system. 
This  was  done  to  provide  a  binary  output 
in  our  own  scheme  (identified  regions 
were  either  marked  or  not  marked), 
which  was  similar  to  that  of  the  commer¬ 
cial  system,  rather  than  a  continuous 
output  (0-1).  Hence,  we  provided  an  au¬ 
tomated  operation  (no  operator  deci¬ 
sions  or  options)  to  an  experienced  staff 
member  (f.S.S.)  who  had  processed  im¬ 
ages  in  several  thousands  of  examina¬ 
tions  with  both  commercial  systems  dur¬ 
ing  the  past  3  years  and  who  processed  all 
the  images  used  in  this  study  with  all 
three  systems. 

The  digitized  images  (model  861;  How- 
tek,  Hudson,  NH)  obtained  with  the  Sec¬ 
ond  Look  system  were  then  transferred  to 
the  in-house  scheme  and  processed  in 


exactly  the  same  manner.  A  true-positive 
finding  detected  by  the  CAD  system  was 
attributed  to  each  mark  (cued  region) 
noted  by  the  CAD  system  if  the  center  of 
the  marked  region  was  overlapping  in 
any  way  (within  the  boundary  of  the 
conservatively  drawn  contour)  with  the 
recorded  mass  area  in  the  manually 
drawn  tmth  file.  Otherwise,  the  CAD  sys¬ 
tem  markings  were  considered  false-pos¬ 
itive  findings.  This  task  was  performed  by 
one  staff  person  (the  same  experienced 
staff  person  mentioned  previously)  to 
avoid  interoperator  biases.  Biases,  if  any, 
were  assumed  to  be  consistent  for  all 
three  systems,  and  this  assumption  en¬ 
abled  a  relative  comparison  among  them, 
even  if  there  were  some  biases  in  absolute 
terms. 

Statistical  Analysis 

True  and  false  findings  were  tabulated 
for  all  examinations.  Both  breast-based 
(on  either  of  the  mammographic  views) 
and  image-based  (each  image  considered 
as  an  independent  examination)  detec¬ 
tions  were  recorded,  and  detection  rates 
per  breast  and  per  image,  as  well  as  false¬ 
positive  rates  per  examination  (all  four 
mammographic  views),  were  computed. 
The  three  systems  were  compared  for  de¬ 
tection  levels  (sensitivity)  by  using  a  re¬ 
peated-measures  binary-response  model 
in  which  there  were  three  replicates,  one 
for  each  patient  according  to  each  of  the 
three  modalities.  The  average  of  false¬ 
positive  cues  provided  among  the  three 
systems  was  compared  by  using  Fried¬ 
man  two-way  analysis  of  variance.  The 
number  of  false-positive  findings  that 
were  detected  in  negative  screening  ex¬ 
aminations  and  those  in  examinations 
for  which  patients  were  recalled  were 
compared  by  using  the  Mann- Whitney  U 


test.  All  analyses  were  performed  with 
software  (SAS,  version  8.2;  SAS  Institute, 
Cary,  NC).  For  each  modality,  the  differ¬ 
ence  in  false-positive  rates  between  neg¬ 
ative  screening  examinations  and  those 
in  which  patients  were  recalled  was  com¬ 
pared,  assuming  independent  Poisson 
distributions.  All  statistical  tests  were  two 
sided,  and  a  difference  with  P  <  .05  was 
considered  significant. 

RESULTS 


Table  1  summarizes  the  findings  in 
groups  1-3  according  to  breast  and  im¬ 
age.  Table  1  also  demonstrates  that  the 
detection  rates  of  the  three  systems  in 
detection  of  true-positive  masses  in 
group  1  (58  breasts  with  malignant 
masses)  were  72%  (42),  71%  (41),  and 
67%  (39)  for  Second  Look,  Image- 
Checker,  and  the  in-house  scheme,  re¬ 
spectively.  Table  1  further  demonstrates 
the  results  of  processing  the  latest  prior 
examinations  with  positive  findings 
(group  2).  These  were  acquired  between  1 
year  and  2  years  4  months  prior  to  the 
subsequent  positive  examinations,  with 
an  average  time  difference  of  1  year  4 
months.  As  expected,  detection  rates 
were  substantially  lower  in  the  same  pa¬ 
tients  when  images  obtained  in  the  latest 
prior  examinations  were  processed.  Al¬ 
though  in  39  breasts  with  malignant 
masses,  23%  (nine),  26%  (10),  and  15% 
(six)  of  masses  were  detected  retrospec¬ 
tively  on  images  obtained  at  prior  exam¬ 
inations  with  CAD,  the  images  in  the  ex¬ 
aminations  were  read  as  not  suspicious 
enough  to  result  in  a  recall  of  the  patient 
during  the  original  clinical  interpreta¬ 
tions.  CAD  detection  rates  in  the  false¬ 
negative  group  (group  3)  with  23  breasts 
with  malignant  masses  were  35%  (eight), 
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TABLE  2 

False-Positive  Cueing  Rate  per  Patient,  per  Region,  and  per  Image  for  All  Examinations 


False-Positive  Rate  False-Positive  Rate  False-Positive  Rate 

according  to  Patient  according  to  Region*  according  to  Image 


Croup  and 
Examination  Type 

Total  No. 
of  Examinations 

Second 

Look 

Image- 

Checker 

In-house 

Scheme 

Second 

Look 

Image- 

Checker 

In-house 

Scheme 

Total  No. 
of  Images 

Second 

Look 

Image- 

Checker 

In-house 

Scheme 

1,  True-positive 

2,  Prior  to  true- 

58 

1 .53  (89) 

1.05  (61) 

1.05  (61) 

1.28  (74) 

0.88(51) 

0.98  (57) 

228 

0.39  (89) 

0.27(61) 

0.27  (61) 

positive 

39 

1.64(64) 

1.13(44) 

1.33  (52) 

1.41  (55) 

1 .00  (39) 

1.10(43) 

156 

0.41  (64) 

0.28  (44) 

0.33  (52) 

3,  False-negative 

4,  Screening 

22 

1.18(26) 

1.00  (22) 

1.50(33) 

0.86  (19) 

0.82  (18) 

1.36  (30) 

84 

0.31  (26) 

0.26  (22) 

0.39  (33) 

mammography 
5,  Mammography 
with  recalled 

50 

1 .68  (84) 

1 .08  (54) 

1 .20  (60) 

1.40  (70) 

0.96  (48) 

1.06(53) 

200 

0.42  (84) 

0.27  (54) 

0.30  (60) 

patients 

50 

2.70  (135) 

2.16(108) 

2.86  (143) 

2.28  (114) 

1.82(91) 

2.68  (1 34) 

200 

0.68  (1 35) 

0.54  (108) 

0.72  (143) 

Note. — All  positive  and  negative  images  obtained  in  each  patient  were  used  to  calculate  the  false-positive  rate.  Numbers  in  parentheses  were  used  to 
calculate  the  false-positive  rates.  False-positive  rates  according  to  patient  and  region  were  based  on  total  number  of  examinations. 

*  If  a  region  was  cued  on  two  mammographic  views,  it  was  counted  as  one  marked  false-positive  region. 


39%  (nine),  and  30%  (seven)  for  the 
three  systems,  respectively. 

Table  2  shows  that  the  false-positive 
rates  in  negative  examinations  (group  4) 
were  1.68,  1.08,  and  1.20  per  examina¬ 
tion  (four  images)  for  the  same  three  sys¬ 
tems,  respectively.  For  the  examinations 
in  which  patients  were  recalled  but  find¬ 
ings  were  later  verified  as  negative  (group 
5),  the  false-positive  rates  were  2.70,  2.16, 
and  2.86,  respectively  (Table  2).  In  Table 
2,  we  also  provide  the  number  of  cued 
but  negative  regions,  which  is  less  than 
the  total  number  of  false-positive  cues, 
after  adjustment  for  regions  that  were 
cued  on  both  mammographic  views  and 
after  counting  of  these  matched  cues  as 
one  false-positive  region.  When  we  com¬ 
pared  the  performance  of  the  three  sys¬ 
tems,  the  differences  were  not  significant 
(P  =  .63)  for  detection  in  actually  positive 
examinations  that  led  to  the  detection  of 
cancer,  and  the  differences  in  false-posi¬ 
tive  rates  were  significant  ( P  =  .008)  for 
the  number  of  false-positive  identifica¬ 
tions  in  the  negative  screening  examina¬ 
tions.  The  differences  in  average  false¬ 
positive  rates  between  negative  screening 
examinations  and  examinations  in 
which  patients  had  been  recalled  were 
significant  (P  =  .002,  P  <  .001,  P  <  .001 
for  the  three  systems). 

DISCUSSION 


Several  studies  about  the  performance 
levels  of  each  of  the  systems  in  question 
have  been  published  (14,16,26,27),  but 
the  differences  in  patient  selection  and 
study  design  make  any  direct  comparison 
difficult.  The  comparison  of  basic  perfor¬ 
mance  levels  can  only  be  performed  ap¬ 


propriately  when  the  systems  are  tested 
on  the  same  set  of  examinations  with  a 
sample  that  is  large  enough  and,  prefera¬ 
bly,  contains  as  representative  a  sample 
as  possible  (eg,  sequentially  ascertained 
examinations)  so  that  results  can  be  gen¬ 
eralized  to  the  screening  population  of 
women. 

It  is  not  reasonable  to  expect,  espe¬ 
cially  on  a  worldwide  level,  that  film  im¬ 
ages  will  quickly  be  totally  replaced  by 
digital  images.  For  that  reason,  most  CAD 
systems  currently  in  use  must  provide  a 
method  to  digitize  images,  and  this  pro¬ 
cess  in  combination  with  differences  in 
CAD  algorithms  may  lead  to  problems  in 
regard  to  standardization  and  reproduc¬ 
ibility  of  results  even  when  applied  to  a 
single  system  (28-30).  There  is  little 
doubt  that  differences  in  performance 
among  CAD  systems  will  remain.  If  we 
want  to  collect  data  that  allow  radiolo¬ 
gists  to  improve  the  practice  of  screening 
mammography,  it  is  important  that  we 
understand  the  possible  effects  that  may 
result  from  using  different  CAD  systems. 
There  are  few  data  about  a  comparison  of 
performance  of  different  CAD  systems 
when  applied  to  the  same  sets  of  exami¬ 
nations.  As  systems  continue  to  evolve 
and  improve,  the  results  of  such  compar¬ 
isons  are  valid  only  for  the  experimental 
conditions  being  implemented  with  the 
specific  systems  (eg,  digitizer  and  soft¬ 
ware  versions)  that  were  studied.  For  that 
reason,  the  results  of  such  studies,  while 
interesting  and  possibly  suggestive  of  the 
effects  of  system  differences  at  a  given 
time  and  for  a  specific  distribution  of  ex¬ 
aminations,  may  be  obsolete  within  a 
short  period.  It  is  important,  however,  to 
recognize  that  there  are  differences  (fre¬ 


quently  substantial)  in  the  performance 
levels  of  different  CAD  systems.  If  such 
differences  affect  radiologists  during  clin¬ 
ical  interpretations  of  findings  of  screen¬ 
ing  mammographic  examinations,  one 
should  be  aware  of  them  (18,31). 

Lechner  et  al  (19)  compared  two  Food 
and  Drug  Administration-approved  CAD 
devices,  ImageChecker  M1000  (R2  Tech¬ 
nologies)  and  Second  Look.  They  found 
that  90%  and  89%  of  abnormalities  asso¬ 
ciated  with  cancers  in  120  examinations 
were  detected  by  the  ImageChecker  and 
Second  Look  systems,  respectively.  While 
100%  and  90%  of  the  ten  examinations 
with  both  masses  and  microcalcification 
clusters  were  detected  with  the  two  sys¬ 
tems,  respectively,  only  84%  and  82%  of 
the  67  masses  without  clusters  were  iden¬ 
tified  with  the  two  systems.  Similar  per¬ 
formance  levels  were  reported  in  other 
studies  (26),  albeit  no  comparisons  with 
other  systems  were  made.  A  review  of  the 
findings  from  these  studies,  as  well  as  of 
the  Food  and  Drug  Administration  ap¬ 
proval  process,  suggests  the  following: 
The  performance  of  the  two  commercial 
systems  is  reasonably  comparable  for  all 
practical  purposes.  If  differences  exist, 
they  are  small  and  would  require  large 
sample  sizes  to  quantify  them  (32). 

Our  study  is  somewhat  different  in  the 
examination  selection  process.  We  at¬ 
tempted  to  select  a  sequentially  acquired, 
and  potentially  representative,  sample  of 
each  type  of  examination  to  allow  gener- 
alizability,  at  least  to  our  own  screening 
population.  Recently,  investigators  in  a 
study  (20)  reported  that  the  patient- 
based  sensitivity  for  detection  of  "action¬ 
able  architectural  distortion"  with  these 
two  systems  when  applied  to  45  exami- 
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nations  (in  43  patients)  was  less  than 
50%  for  either  system.  In  another  study 
of  retrospectively  reviewed  prior  exami¬ 
nations  with  findings  that  suggested  "ev¬ 
idence  of  cancer  on  prior  mammo¬ 
grams,"  approximately  50%  sensitivity 
for  mass  detection  (eight  of  19  with  Sec¬ 
ond  Look  and  12  of  19  with  Image- 
Checker)  on  prior  images  was  indicated 
(22). 

Although  our  study  is  similar  to  that  of 
Shile  and  Guingrich  (22)  in  that  we  at¬ 
tempted  to  select  a  representative  popu¬ 
lation  of  examinations,  it  differs  in  sev¬ 
eral  respects.  First,  we  included  a  series  of 
all  available  sequentially  acquired  sets  of 
examinations. 

Second,  our  false-positive  rate  was 
computed  from  a  set  of  negative  exami¬ 
nations  rather  than  from  the  same  exam¬ 
inations  in  which  a  mass  was  found. 

Third,  our  assessment  of  CAD  perfor¬ 
mance  in  the  five  sets  of  examinations 
allows  one  to  have  a  better  perspective  of 
the  possible  effect  of  CAD  on  clinical 
practices  with  each  type  of  mammogram. 
In  our  study,  performance  of  all  systems 
was  at  somewhat  lower  levels  than  ex¬ 
pected.  This  could  be  the  result  of  several 
factors.  These  factors  included,  but  were 
not  limited  to,  the  difficulty  of  detection 
of  the  "average"  cancer  with  our  screen¬ 
ing  program.  The  conservatively  defined 
mass  regions  (targets)  reduced  the  possi¬ 
bility  of  biases  that  would  result  from 
exact  marking.  The  use  of  only  one  expe¬ 
rienced  person,  who  was  not  involved  in 
our  CAD  development  team,  to  rate  the 
correct  markings  ensured  consistency  in 
the  scoring.  This  should  have  decreased, 
if  not  completely  eliminated,  any  biases 
in  the  relative  comparison  among  the 
systems.  At  this  level  of  performance,  we 
showed  that  experienced  radiologists  do 
not  substantially  improve  their  mass  de¬ 
tection  performance  levels  in  the  labora¬ 
tory  (18),  and  we  suspect  that  this  might 
be  the  case  in  the  clinical  environment, 
as  well  (17).  Interestingly,  the  false-posi¬ 
tive  rates  for  examinations  in  which  pa¬ 
tients  were  recalled  but  that  later  proved 
to  be  negative  examinations  (group  5) 
were  higher  than  were  the  rates  for  neg¬ 
ative  screening  examinations.  This  find¬ 
ing  suggests  that  these  mammograms  are 
more  difficult  for  the  CAD,  as  well  as  for 
the  human  observer,  to  analyze  correctly. 
The  performance  of  all  three  CAD  sys¬ 
tems  was  not  very  high  in  both  the  sets  of 
examinations  with  false-negative  inter¬ 
pretations  and  prior  examinations  with 
actually  positive  interpretations.  This 
finding  suggests  that,  at  least  in  our  en¬ 
vironment,  the  potential  improvements 


in  earlier  detection  of  masses  with  the  use 
of  current  CAD  systems  is  perhaps  some¬ 
what  limited.  Although  seemingly  unim¬ 
portant  as  long  as  detection  rates  are 
comparable,  the  false-positive  rates  may 
affect  general  radiologists'  reliance  on 
the  CAD  results.  High  false-positive  rates 
may  result  in  low  reader  confidence  in 
the  CAD  marking,  since  many  cues  have 
to  be  reviewed  and  discarded  as  negative 
findings  (18). 

In  addition,  there  are  some  indications 
that  performance  in  the  noncued  areas 
may  be  affected  by  the  false-positive  rate, 
as  well  (18).  Because  of  the  substantial 
difference  in  medicolegal  liability  be¬ 
tween  false-negative  and  false-positive 
interpretations,  the  effect  of  the  CAD¬ 
generated  false-positive  cueing  rate  on 
noncued  cancers  may  be  an  important 
issue  to  consider. 

As  to  the  lower  performance  of  our 
own  in-house  scheme  for  CAD,  we  note 
that  the  scheme  was  originally  designed 
and  optimized  for  images  digitized  with  a 
different  digitizer  (18,25),  which  has  sub¬ 
stantially  different  signal  and  noise  char¬ 
acteristics.  Also,  our  current  scheme  does 
not  limit  the  total  number  of  regions 
identified  as  suspicious  per  examination, 
as  do  other  systems  (33).  Despite  these 
limitations,  it  performed  reasonably  well 
in  a  direct  comparison  with  two  commer¬ 
cial  systems. 

Our  study  had  several  limitations. 
First,  as  previously  indicated,  our  selec¬ 
tion  protocol  may  have  been  somewhat 
biased  in  favor  of  the  ImageChecker  sys¬ 
tem  in  that  the  images  obtained  in  these 
examinations  (with  the  exception  of 
group  2)  had  been  processed  with  this 
system  during  the  initial  clinical  inter¬ 
pretation,  and  this  bias  possibly  influ¬ 
enced  the  results  of  these  examinations. 
However,  our  experience  to  date  indi¬ 
cates  that  in  our  practice  the  changes 
were  minor  at  best,  particularly  with  re¬ 
spect  to  the  detection  of  masses  (17). 

Second,  the  verification  of  negative  ex¬ 
aminations  was  based  on  findings  at  the 
subsequent  screening  examination.  Al¬ 
though  not  optimal,  this  was  the  most 
recent  available  examination  at  the  time, 
and  we  assumed  that  errors  in  this  regard, 
if  any,  were  not  likely  to  affect  the  rela¬ 
tive  performance  comparisons  we  de¬ 
scribed. 

Third,  the  study  was  limited  to  the 
mammograms  acquired  at  one  institu¬ 
tion  and  the  masses  detected  by  one 
group  of  radiologists.  However,  we  do 
not  believe  that  this  limitation  affected 
the  results  in  a  manner  that  would  sub¬ 


stantially  affect  similar  comparisons  at 
other  institutions. 

Fourth,  our  conservative  approach  to 
generation  of  the  targets  (ie,  drawing  the 
mass  regions)  may  have  affected  the  re¬ 
sults.  However,  we  verified  that  this  effect 
was  not  substantial  (<5%  in  this  set  of 
cases)  and  did  not  affect  the  comparison 
of  relative  performance  levels  of  the  three 
CAD  systems. 

Fifth,  it  could  be  argued  that  one  of  the 
limitations  of  the  study  was  that  we 
tested  complete  systems  and  not  the  soft¬ 
ware  scheme  alone.  Hence,  the  compari¬ 
son  could  have  been  affected  by  the  digi¬ 
tizers  in  the  two  commercial  systems  we 
used.  The  fact  is  that  a  commercial  CAD 
system  is  integrated,  and  these  systems 
were  tested  largely  as  they  would  be  used 
in  a  clinical  environment.  In  this  study, 
we  cannot  comment  on  a  comparison 
that  would  be  based  on  testing  of  the 
software  alone.. 

Last,  our  study  focused  on  the  detec¬ 
tion  of  masses.  The  significantly  higher 
performance  of  CAD  systems  in  the  de¬ 
tection  of  microcalcifications  may  be  suf¬ 
ficient  to  warrant  the  routine  use  of  these 
systems  alone.  Other  nondetection  is¬ 
sues,  such  as  the  assessment  of  possible 
efficiency  improvements  in  the  reading 
of  mammograms  because  of  the  high  per¬ 
formance  in  the  detection  of  microcalci¬ 
fications,  were  clearly  beyond  the  scope 
of  this  study. 

In  summary,  we  observed  somewhat 
lower  than  expected  case-based  and  im¬ 
age-based  detection  rates  with  CAD  for 
all  three  systems.  This  is  not  to  indicate 
that  CAD  cannot  help  the  radiologist, 
even  at  these  levels  of  performance,  in 
different  clinical  environments,  particu¬ 
larly  radiologists  with  less  experience  in 
the  interpretation  of  screening  mammo¬ 
grams.  However,  the  level  of  improve¬ 
ment  is  not  likely  to  be  what  had  been 
estimated  from  retrospective  studies  in  a 
laboratory  environment.  Results  of  this 
study  clearly  indicate  that  marked  im¬ 
provements  in  CAD  performance  levels 
for  mass  detection  are  both  desired  and 
possible,  and  continuing  efforts  should 
be  expanded  in  this  area. 
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APPENDIX  15 


The  Effect  of  Routine  Use  of  a  Computer-Aided 

Detection  System  on  the  Practice 

of  Breast  Imagers: 

A  Subjective  Assessment1 


The  optimal  use  of  any  technological  tool,  such  as  com¬ 
puter-aided  detection  (CAD),  requires  the  user  to  both 
understand  the  strengths  and  limitations  of  the  technology 
and  feel  at  ease  in  adapting  it  into  his  or  her  practice. 
Much  previous  and  ongoing  effort  has  been  directed  to 
the  study  of  the  impact  of  CAD  systems,  in  terms  of  de¬ 
vice  performance  (eg,  digitizer,  detection  algorithms)  and 
clinical  impact  (eg,  detection  of  cancers,  recall  rates)  after 
implementation  (1-5).  However,  to  date,  there  is  no  pub¬ 
lished  information  about  the  perception  of  breast  imagers 
with  substantial  experience  in  using  mammography  CAD 
with  regard  to  what  impact  it  had  on  their  own  practice. 
This  is  an  important  issue  because  breast  imagers  could 
ignore  the  CAD  results  altogether  if  they  felt  uncomfort¬ 
able  with  the  cueing  results.  Reimbursement  for  CAD 
would  then  in  effect  become  an  unnecessary  expense.  In 
addition,  there  may  be  an  increase  in  liability  for  breast 
imagers  in  cases  where  CAD  cues  are  actually  correct  but 
cancers  were  missed  by  the  radiologist. 
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Current  mammography  CAD  schemes  provide  false¬ 
positive  cues  in  the  majority  of  cases,  and  therefore  a 
primary  concern  expressed  about  the  widespread  incorpo¬ 
ration  of  CAD  into  screening  mammography  practices  is 
the  potential  for  “over-reading,”  namely,  recalling  too 
many  women  for  additional  breast  imaging  procedures 
(6,7).  If  breast  imagers  themselves  believe  that  they  are 
recalling  too  many  women  or  taking  too  much  time  to 
interpret  examinations  to  rule  out  the  false-positive  cues, 
they  may  ignore  the  CAD  results  altogether.  To  evaluate 
the  breast  imager’s  perceptions  of  differences  in  their 
practice  with  respect  to  recall  rate,  or  the  time  required  to 
interpret  a  mammogram  when  CAD  is  used,  we  surveyed 
12  highly  experienced  breast  imagers  who  had  at  least  2 
years  of  practice  at  Magee  Womens  Hospital  of  the  Uni¬ 
versity  of  Pittsburgh  Medical  Center  (Pittsburgh,  PA). 

In  June  of  2001,  our  facility  began  using  a  commer¬ 
cially  available  CAD  system  (Image  Checker;  R2  Tech¬ 
nology,  Sunnyvale,  CA)  for  most  screening  and  diagnos¬ 
tic  mammograms  acquired  in  our  facilities.  The  screen 
films  are  digitized  and  analyzed  by  the  CAD  system  and 
examinations  are  interpreted  using  an  alternator  that  dis¬ 
plays  the  CAD  cues  on  a  monitor  placed  below  the  dis¬ 
played  films.  A  large  fraction  of  our  diagnostic  examina¬ 
tions  performed  at  the  hospital  breast  center  are  per¬ 
formed  using  a  Full  Field  Digital  Mammography  System 
(Senographe  2000;  GE  Medical  Systems,  Waukesha,  WI). 
This  system  uses  an  algorithm  provided  by  R2,  and  CAD 
cues  are  displayed  on  the  system’s  dedicated  workstation. 

To  assess  how  breast  imagers  subjectively  perceived 
the  effect  of  CAD  on  their  own  practice,  if  any  at  all,  we 
administered  a  voluntary  short  survey.  This  survey  was 
performed  approximately  1  year  after  the  introduction  of 
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Table  1 

Distribution  of  Answers  to  Each  of  the  Survey  Questions 

Did  not 

Direction  of  Change 

Questions  of  how  Practice  has  Changed  in  the  last  3  Years 

Change 

Changed 

Increased 

Decreased 

1 .  My  general  practice: 

2.  My  recall  rates  during  interpretations  of  screening 

6 

6 

mammograms 

7 

5 

4 

1 

3.  My  recommendations  for  biopsy  as  a  fraction  of 
diagnostic  interpretation 

4.  My  recommendations  for  the  use  of  US  as  a  fraction  of 

12 

0 

0 

0 

the  total  number  of  diagnostic  procedures 

11 

1 

1 

0 

5.  The  time  spent  reading  each  screening  examination  has 

3 

9 

4 

5 

6.  The  time  spent  reading  each  diagnostic  examination  has 

8 

4 

3 

1 

7.  The  use  of  CAD  affected  the  way  1  practice 

3 

9 

CAD  into  our  screening  practice.  The  survey  questions 
asked  the  breast  imagers  about  any  perceived  changes  in 
his  or  her  practice  in  the  last  3  years.  It  must  be  noted 
that  this  survey  was  administered  before  any  of  the  partic¬ 
ipants  became  aware  of  the  results  we  obtained  assessing 
the  actual  impact  of  CAD  on  recall  and  detection  rates  in 
our  practice,  which  showed  only  minimal  changes  in  re¬ 
call  rates  (3).  We  wish  to  emphasize  that  participants 
were  told  the  survey  was  being  conducted  primarily  to 
assess  their  subjective  feelings  about  changes  in  their 
practice  as  a  result  of  the  use  of  CAD. 

The  distribution  of  the  answers  to  each  question  in  our 
survey  is  shown  in  the  Table.  Although  only  six  of  the 
respondents  (50%)  indicated  that  their  practice  had 
changed  in  the  last  3  years  (question  1),  nine  of  12  an¬ 
swered  positively  to  the  question  whether  or  not  CAD 
had  changed  the  way  they  practiced  (question  7).  These 
nine  radiologists  perceived  a  number  of  CAD-related 
changes.  Five  perceived  a  change  in  reading  time;  one  in 
reading  time  and  recall  rate;  one  in  recall  rate  only;  one 
in  reading  time,  recall  rate,  and  rate  of  recommendation 
for  breast  ultrasound;  and  one  did  not  mark  any  specific 
changes  listed  in  the  survey  (questions  2-6).  Of  the  three 
radiologists  who  perceived  that  there  was  no  change  in 
their  practice  because  of  CAD,  two  indicated  general 
changes  in  their  practice.  One  indicated  a  decrease  in  re¬ 
call  rate  and  an  increase  in  screening  reading  time,  the 
other  indicated  an  increase  in  recall  rate  and  a  decrease  in 
interpretation  time.  Therefore,  only  one  respondent  indi¬ 
cated  no  specific  change  in  response  to  all  questions  in 
the  survey. 


Five  respondents  perceived  that  their  screening  recall 
rate  had  changed  (four  thought  that  it  had  increased,  one 
indicated  that  it  had  decreased),  and  one  reader  perceived 
his/her  rate  of  recommending  breast  ultrasound  had  in¬ 
creased.  None  of  the  12  readers  perceived  any  change  in 
their  rate  of  recommending  biopsy  as  a  result  of  using 
CAD  during  the  interpretation  of  diagnostic  procedures. 
Unsolicited  comments  from  four  readers  suggested 
changes  were  considered  to  be  temporary,  a  “learning 
effect.”  There  was  one  unsolicited  comment  written  ques¬ 
tioning  the  overall  usefulness  of  CAD  in  a  diagnostic  set¬ 
ting. 

This  survey  was  not  intended  as  a  proof  of  actual 
changes  in  the  practice  of  breast  imagers,  if  any,  as  a  re¬ 
sult  of  incorporation  of  CAD  into  the  diagnostic  process. 
Rather  it  was  designed  to  assess  their  perceptions  in  this 
regard.  The  fact  that  all  but  one  reader  recorded  some 
kind  of  practice  change  suggests  that  the  CAD  results  are 
not  simply  being  ignored.  The  wide  distribution  of  the 
answers,  indicating  fairly  equally  perceived  increases  and 
decreases  in  interpretation  times  and  recall  rates,  suggests 
that  they  adopted  the  practice  without  any  significant  dif¬ 
ficulties.  Interestingly,  their  assessment  of  the  impact  on 
recall  rates  generally  agrees  with  actual  observation  we 
made  during  a  review  of  over  100,000  examinations  when 
interpreted  with  and  without  the  use  of  CAD  result.  In 
summary,  as  a  group,  the  breast  imagers  practicing  in  our 
facility  have  incorporated  the  use  of  CAD  with  little  con¬ 
cern  that  their  practice  has  been  substantially  altered  with 
regard  to  recall  rates  or  time  required  for  the  interpreta¬ 
tion  of  mammograms. 
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Teleradiology  and  screening  mammography:  a  telemammography 
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ABSTRACT 

Radiologists’  performance  reviewing  and  rating  breast  cancer  screening  mammography  exams  using  a 
telemammography  system  was  evaluated  and  compared  with  the  actual  clinical  interpretations  of  the  same 
interpretations.  Mammography  technologists  from  three  remote  imaging  sites  transmitted  245  exams  to  a  central  site 
(radiologists),  which  they  (the  technologists)  believed  needed  additional  procedures  (termed  “recall”).  Current  exam 
image  data  and  non-image  data  (i.e.,  technologist’s  text  message,  technologist’s  graphic  marks,  patient’s  prior  report, 
and  Computer  Aided  Detection  (CAD)  results)  were  transmitted  to  the  central  site  and  displayed  on  three  high- 
resolution,  portrait  monitors.  Seven  radiologists  interpreted  (“recall”  or  “no  recall”)  the  exams  using  the 
telemammography  workstation  in  three  separate  multi-mode  studies.  The  mean  telemammography  recall  rates  ranged 
from  72.3%  to  82.5%  while  the  actual  clinical  recall  rates  ranged  from  38.4%  to  42.3%  across  the  three  studies.  Mean 
Kappa  of  agreement  ranged  from  0.102  to  0.213  and  mean  percent  agreement  ranged  from  48.7%  to  57.4%  across  the 
three  studies.  Eighty-seven  percent  of  the  disagreement  interpretations  occurred  when  the  telemammography 
interpretation  resulted  in  a  recommendation  to  recall  and  the  clinical  interpretation  resulted  in  a  recommendation  not 
to  recall.  The  poor  agreement  between  the  telemammography  and  clinical  interpretations  may  indicate  a  critical 
dependence  on  images  from  prior  screening  exams  rather  than  any  text  based  information.  The  technologists  were 
sensitive,  if  not  specific,  to  the  mammography  features  and  changes  that  may  lead  to  recall.  Using  the 
telemammography  system  the  radiologists  were  able  to  reduce  the  recommended  recalls  by  the  technologist  by 
approximately  25  percent. 

Keywords:  Teleradiology,  human  performance,  recall  rate,  breast  cancer  screening,  mammography 

1.  INTRODUCTION 

Screening  for  breast  cancer  using  mammography  is  and  will  continue  to  be  practiced  worldwide  with  extensive 
research  supporting  the  benefits  of  screening,1’6  despite  sporadic  studies  reporting  limited  or  no  benefit  from  screening 
mammography.7’9  The  ubiquitous  practice  and  growing  population  of  candidates  for  screening  mammography  present 
many  challenges  to  the  practitioners  creating  possible  opportunity  to  improve  screening  mammography.  Some 
elements  of  screening  mammography  that  have  the  potential  to  be  improved  include  radiologist’s  practice  and 
performance,  personnel  shortages,  and  public  perception  and  compliance.10'16 

The  practice  of  teleradiology  may  potentially  improve  the  management  of  screening  mammography  in  particular  in 
remote  or  underserved  locations  where  physicians  are  not  physically  present,  but  the  high-spatial  demands  of 
mammography  present  challenges  to  effective  implementation  of  telemammography  based  practices.  There  are 
several  image  processing  techniques  commonly  used  in  teleradiology  to  facilitate  handling  large  amounts  of  data, 
which  include  image  compression,  image  cropping,  image  selection,  and  display  format.17"24  In  addition,  the 
qualifications  of  personnel  necessary  for  the  successful  implementation  may  need  to  be  evaluated.25’27  We  have 
design  and  tested  a  telemammography  system  capable  of  handling  the  large  data  requirements  of  mammography  that 
is  operated  by  mammography  technologists  at  the  remote  sites  and  experienced  radiologists  who  review  transmitted 
examinations  at  the  central  site.28’30 
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In  this  study,  we  evaluated  radiologists’  performance  during  off-line  reviewing  and  rating  screening  mammography 
exams  using  the  telemammography  system  and  compared  their  performance  to  the  clinical  interpretations  of  the  same 
examinations.  The  incremental  addition  of  information  was  analyzed  in  three  separate  multi-mode  studies  to 
determine  the  independent  effect  of  each  type  of  information.  The  motivation  was  to  determine  what  information  is 
necessary  and  sufficient  in  a  telemammography  system  implemented  to  reduce  the  number  of  patients  recalled  for 
additional  procedures  as  part  of  the  screening  exam.  The  long-term  objective  is  to  decrease  the  number  of  patients 
recalled  by  remote  management,  particularly  in  underserved  areas. 

2.  METHODS 


2.1  Telemammography  system 

Cases  used  in  this  retrospective  study  were  accrued  using  a  high-quality,  multi-site  telemammography  system  that 
consists  of  one  central  site  (Magee-Womens  Hospital,  Pittsburgh,  PA,  USA)  and  three  remote  sites  (satellite  woman’s 
imaging  centers  of  Magee-Womens  Hospital).  Cases  for  this  study  were  acquired  under  normal  operating  procedures. 
The  specific  technical  information  such  as  software  design,  image  processing,  workstation  features,  and  inter-site 
communication  are  described  in  detail  by  Drescher  et  al.30  (2003).  The  system  is  described  briefly  as  necessary 
below. 

In  short,  a  technologist  at  a  remote  site  digitizes  mammographic  films,  composes  a  message  (text  and  graphic)  to 
describe  and  locate  their  impression,  and  scans  the  patient’s  prior  report  (when  available).  This  information  and  the 
results  of  Computer  Aided  Detection  (CAD)  scheme  to  detected  suspicious  regions  are  transmitted  to  the  central  site. 
At  the  central  site,  a  radiologist  reviews  the  mammographic  image  data,  technologists  message  (text  and  graphic),  and 
patient  reports  using  a  three  monitor  custom  workstation. 

2.1.1  Mammographic  image  digitization  and  processing 

The  first  step  in  the  image  acquisition  pipeline  at  the  remote  sites  is  to  digitize  the  mammographic  films  using  a  high- 
resolution,  laser  film  digitizers  (Lumiscan  85,  Eastman  Kodak,  Rochester,  NY,  USA)  at  50  micron  pixel  dimensions 
and  12-bit  grayscale.  Next,  the  digital  images  are  automatically  cropped  to  reduce  the  non-tissue  areas  surrounding 
the  breast,  which  significantly  reduces  the  image  size.  A  CAD  scheme  is  then  executed  on  the  cropped  images.  Next, 
the  images  are  compressed  at  a  ratio  of  75:1  using  the  irreversible  (lossy),  9/7  transform,  wavelet-based  JPEG  2000 
method.  Finally,  the  image  data  are  parsed  into  data  packets  and  encrypted  using  strong  128  bit  Microsoft  Point-to- 
Point  Encryption  (MPPE)  with  Microsoft  Challenge  Handshake  Authenticate  Protocol  (CHAP)  version  2.  The  data 
packets  and  CAD  results  are  transmitted  to  the  central  site. 

At  the  central  site,  the  mammographic  image  data  are  decrypted  and  decompressed.  Image  display  on  the  workstation 
is  enhanced  through  minimal  unsharp  masking.  Look-up  table  (LUT)  values  are  automatically  calculated  to  aid  image 
viewing.  To  reduce  the  visual  effects  of  cropping  images  are  restored  to  full  height,  but  not  to  full  width,  by  padding 
(filling)  prior  to  image  display.  The  CAD  results  are  presented  as  an  overlay  on  the  images  with  regions  suspicious 
for  masses  outlined  and  regions  suspicious  for  microcalcification  circled. 

2.1.2  Remote  Site 

The  computer  hardware  at  the  three  remote  sites  is  an  Athlon  900  machine  with  a  900  mHz  processor  and  512  MB  of 
RAM  (Advanced  Micro  Device,  Sunnyvale  CA,  USA)  operating  under  Microsoft  Windows  2000  Workstation 
(Microsoft  Corporation,  Redmond,  WA,  USA).  They  are  equipped  with  both  56K  hardware  modems  and  ethernet 
network  cards  (Integrated  PRO/100  S  Desktop  Adapter,  Intel  Corporation,  Santa  Clara,  CA,  USA).  Sites  1,  2  and  3 
are  15,  20,  and  15  miles  from  the  central  site,  respectively.  However,  we  successfully  tested  the  system  in  the  past  at  a 
site  located  90  miles  from  the  central  site.  Sites  1  and  2  transmit  data  across  Plain  Old  Telephone  System  (POTS) 
lines.  Site  3  transmits  data  across  the  Local  Area  Network  (LAN). 

The  technologists  scan  the  patient’s  prior  report  or  history  using  hp  Scanjet  5470c  scanners  (Hewlett-Packard 
Company,  Palo  Alto,  CA,  USA)  that  are  equipped  with  automatic  document  feeders.  Prior  to  transmission  to  the 
central  site  the  reports  are  converted  to  one  bit  per  pixel  portable  network  graphic  (PNG)  images. 
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2.1.3  Central  Site 

The  central  site  telemammography  workstation  is  powered  by  an  Athlon  MP  dual  1 .2  GHz  multi-processor  with  2  GB 
of  RAM  (Advanced  Micro  Device,  Sunnyvale  CA,  USA),  which  operates  under  Microsoft  Windows  2000  Server 
(Fig.  1).  The  workstation  display  consist  of  three  high-resolution  (2048  x  2560),  8-bit  grayscale,  portrait  monitors  at  a 
nominal  setting  of  80  ftL;  two  Dome  C5i  flat-panel  monitors  (Planar  Systems,  Beaverton,  OR,  USA)  for  image 
display  and  one  Clinton  DS5100P  cathode  ray  tube  monitor  (Clinton  Electronics,  Rockford,  IL,  USA)  for  text.  The 
workstation  communicates  with  the  remote  sites  via  56K  hardware  modems  (U.S.  Robotics,  Rolling  Meadows,  IL, 
USA)  and  ethemet  network  cards  (OfficeConnect  10/100  NIC,  3COM,  Santa  Clara,  CA,  USA). 


Fig  1.  Telemammography  workstation  at  the  central  site  in  the  default  viewing  format. 

The  key  display  features  available  on  the  workstation  include  manual  LUT  adjustments,  magnification,  quadrant 
viewing  (images  viewed  one  quadrant  at  a  time),  and  multiple  display  formats,  which  are  all  mouse-driven.  Possible 
image  display  formats  include:  one  image/monitor,  two  images/monitor,  or  four  images/monitor.  The  typical  display 
resolution  was  approximately  100  micron  pixel  dimensions  for  one  image/monitor  and  200  micron  pixel  dimensions 
for  two  images/monitor.  Images  can  be  magnified  by  a  free-moving  magnification  box  or  quadrant  panning.  The 
magnification  box  size  varies  depending  on  the  image  display  format;  for  one  image/monitor  the  box  is  5 1 1  x  566 
pixels  and  for  two  images/monitor  the  box  is  204  x  266  pixels.  The  left  and  center  monitors  display  the  image  data 
with  the  CAD  results  overlaid,  and  the  right  monitor  displays  the  message  windows,  prior  reports,  case  lists,  etc. 

2.1.4  Inter-site  communication 

The  technologists  (remote  site)  and  radiologists  (central  site)  communicate  effectively  using  a  message  window  that 
features  free  text  and  interactive  graphic  windows  and  operates  in  almost  real-time  (Fig.  2).  Typically  a  message 
window  is  sent  with  each  case  with  communication  performed  in  one  cycle.  The  technologist  sends  a  message  with 
each  case,  and  the  radiologist  responds  directly  to  the  message.  The  message  window  at  the  remote  and  central  sites 
both  contains  five  areas:  (1)  patient  demographics,  (2)  message  display  area,  (3)  pull-down  menus,  (4)  interactive 
generic  image  of  breast,  and  (5)  free  text  area.  There  are  five  pull-down  menus  on  the  technologist  message  window 
to  focus  communication  on  possible  actionable  items  that  indicate:  (1)  breast:  left  or  right;  (2)  view:  craniocaudal 
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and/or  mediolateral  oblique;  (3)  finding:  mass,  calcifications,  tissue  asymmetry,  palpable  lump,  or  nodule;  (4) 
comparison  with  prior  exam:  baseline  exam,  new  finding,  slight  change,  moderate  change,  or  remarkable  change;  (5) 
other  findings  stable,  and  (6)  possible  additional  procedure  needed:  additional  views  and/or  ultrasound.  The 
interactive  generic  image  of  the  breast  allows  the  technologists  to  place  an  “X”  mark  precisely  on  the  region  of 
suspicion.  The  radiologists  can  reply  after  reviewing  each  case.  His/her  response  includes:  (1)  do  recommended 
procedure  as  suggested;  (2)  no  additional  procedures  necessary;  and  (3)  do  not  do  the  procedure  recommended,  but  do 
X,  Y,  and  Z.  If  the  radiologists  recommends  additional  procedures,  the  interactive  generic  image  of  the  breast  allows 
the  radiologist  to  place  a  “square”  mark  precisely  on  the  region  that  requires  the  additional  work-up. 


T  echnologist@Site9 1 

Right  Breast;  MLO  -  Upper-Anterior;  CC- Medial; 
Current  Findings  -  Calcifications. 

Moderate  Change  compared  to  prior  exam; 

All  Other  Findings  Stable. 

Should  I  do  a  Magnification? 

Radiologist@SiteO 

Ok,  do  the  recommended  procedure, 


Exam  Date 
|01  r'05/04 


Exam  Code  Message  Status  Unread  Messages 


Wednesday,  January  05,2005  17:16:18 


Wednesday,  January  05, 2005  17:20:30 


Fig  2.  Remote  site  message  window  used  by  the  technologists  to  communicate  with  the  central  site  (radiologists). 


2.1.5  Telemammography  system  operation 

The  system  has  been  operational  for  greater  than  two  years  and  to  date  over  2000  cases  have  been  transmitted.  The 
image  quality,  image  display,  effects  of  the  image  processing,  and  telemammography  system  features  are  generally 
well-received  and  considered  more  than  adequate  for  reviewing  screening  mammography  examinations  by 
participating  radiologists.  The  magnification  features  provide  a  detailed  review  of  the  breast  tissue  patterns, 
particularly  microcalcifications.  The  automated  LUT  settings  were  acceptable  in  nearly  90%  of  the  cases  during 
review.  The  breast  tissue  was  completely  retained  following  the  automated  cropping  producing  images  that  were 
visibly  appealing  for  review.  There  may  be  some  detectable  differences  at  extremely  high  magnifications  between 
non-compressed  digitized  mammographic  images  and  images  compressed  at  a  75:1  ratio,  but  based  on  several 
assessments  these  differences  should  not  affect  the  diagnostic  image  quality.  In  a  two-alternative  forced  choice 
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discrimination  experiment  radiologists  could  not  accurately  or  reliably  discriminate  between  non-compressed  images 
and  those  compressed  at  50:1  and  75:1  compression  levels  when  displayed  side-by-side.24 

2.2  Case  selection 

Two  hundred  and  forty-five  breast  cancer  screening  mammography  exams  were  retrospectively  evaluated  during  this 
study  that  were  acquired  using  the  telemammography  system  from  three  remote  woman’s  imaging  centers.  Registered 
mammography  technologists  from  the  remote  imaging  centers  transmitted  screening  exams  to  the  central  site 
(radiologists)  that  they  (the  technologists)  believed  needed  additional  imaging  procedures.  The  technologists  selected 
the  exams  prospectively  and  were  unaware  at  the  time  of  selection  whether  or  not  the  patient  would  ultimately 
undergo  additional  procedures  during  the  actual  clinical  interpretation.  One  hundred  and  thirty  cases  were  used  in 
Study  1 .  Study  2  consisted  of  99  cases  that  were  a  subset  of  the  cases  used  in  Study  1 .  One  hundred  and  fifteen 
different  cases  were  used  in  Study  3.  The  actual,  subsequent  clinical  interpretation  categorized  each  case  using  the 
Breast  Imaging  Reporting  and  Data  System  (BIRADS)  (Table  1).  The  four  routine  mammographic  films  acquired  at 
our  centers  during  a  breast  cancer  screening  exam  include  the  left  and  right  craniocaudal  views  (LCC  &  RCC),  and 
left  and  right  mediolateral  oblique  views  (LMLO  &  RMLO). 

Table  1 

Distribution  of  BIRADS  categories  as  a  result  of  clinical  interpretation 

BIRADS  Category _ 0  1 _ 2 _ total 


Study  1 

51 

34 

45 

130 

Study  2 

38 

25 

36 

99 

Study  3 

47 

41 

27 

115 

2.3  Study  design  and  data  analysis 

This  study  was  composed  of  three  separate  multi-mode  studies  in  which  information  was  incrementally  presented 
progressively  during  each  of  the  individual  modes  (Table  2).  All  modes  were  completed  during  a  single  reading.  Five 
components  of  information  were  presented  during  the  three  studies:  (1)  four  mammographic  images;  (2) 
technologist’s  text  message  detailing  the  region  of  suspicions  in  terms  of  type  of  finding  (e.g.,  mass, 
microcalcifications),  location,  comparison  to  prior  exams  (when  available),  and  their  (the  technologists)  recommended 
additional  procedures;  (3)  patient’s  report  from  the  prior  mammography  exam  (when  available);  (4)  technologist’s 
graphic  marks  on  a  generic  breast  image  to  highlight  the  region  of  suspicion;  and  (5)  CAD  results. 

Table  2 

Information  presented  for  case  interpretation  during  each  mode  of  the  three  studies 


Study 

Mode  1 

Mode  2 

Mode  3 

1 

mammographic  images  only 

mammographic  images  & 
technologist’s  message 

n/a 

2 

mammographic  images  & 
technologist’s  message 

mammographic  mages,  message, 
&  prior  report 

n/a 

3 

mammographic  images, 
technologist’s  message,  & 
prior  report 

mammographic  images, 
message,  prior  report,  & 
technologist’s  graphic  marks 

mammographic  images, 
message,  prior  report,  graphic 
marks,  &  CAD 

Seven  board  certified  radiologists  specializing  in  mammography  participated  as  readers  in  this  study  who  were 
informed  of  the  exam  origination  and  case  selection  criteria,  but  not  the  mix  of  “recall”  and  “no-recall”  cases.  They 
reviewed  and  rated  the  screening  mammography  exams  on  the  telemammography  workstation  and  indicated:  (1)  if 
additional  procedures  were  recommended,  (2)  when  appropriate,  which  breast  was  involved,  and  (3)  when 
appropriate,  the  specific  recommended  procedures.  During  the  telemammography  interpretation  the  ratings  were 
recorded  via  a  computerized  scoring  form  displayed  on  the  workstation  monitor  using  the  computer  mouse  (Fig.  3). 
The  full  functionality  of  the  workstation  (e.g.,  window  and  level,  magnification,  quadrant  viewing)  was  available 
during  case  review.  The  experience  of  the  radiologists  ranged  from  6  to  33  years  with  each  performing  or  reading 
over  2000  breast  imaging  procedures  per  year.  Two  radiologists  participated  in  all  three  studies.  Two  radiologists 
participated  in  Studies  1  and  2.  Three  radiologists  participated  only  in  Study  3.  (Reader  order  was  scrambled). 
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The  performance  of  the  radiologists  when  using  the  telemammography  system  was  compared  with  the  actual  clinical 
interpretation  of  the  same  screening  mammography  examinations.  Performance  was  evaluated  in  terms  of  the  percent 
of  exams  recommended  recalled  for  additional  procedures  (termed  “recall”),  the  percent  agreement,  and  the  Kappa  of 
agreement  during  both  types  of  interpretations  (i.e.,  telemammography  and  clinical). 


Fig  3.  Scoring  form  used  by  the  radiologists  as  it  would  appear  during  mode  3  of  Study  3. 


3.  RESULTS 


Technologists  were  able  to  identify  suspicious  examinations  that  may  require  additional  procedures,  but  their 
“recommended”  examinations  amounted  to  a  substantially  larger  number  compared  with  that  of  the  actual  clinical 
interpretation  by  a  radiologist.  The  percent  of  exams  recommended  for  recalled  for  additional  procedures  (termed 
“recall”)  during  the  actual  clinical  interpretation  for  Studies  1,  2  (a  subset  of  Study  1),  and  3  were  39.2%  (51/130), 
38.4%  (38/99),  and  40.9%  (47/115),  respectively.  The  screening  exams  sent  by  the  technologists  were  those  cases 
that  they  (the  technologists)  believed  need  additional  imaging  procedures  to  complete  the  exam.  The  245  exams  were 
successfully  transmitted,  processed,  reviewed,  and  rated. 

The  recall  rates  for  all  radiologists  during  the  telemammography  interpretations  all  three  multi-mode  studies  were 
significantly  higher  than  the  actual  clinical  interpretations  (Tables  3  -  9).  As  a  result,  there  was  poor  agreement 
between  the  two  interpretations  types  (telemammography  and  clinical)  for  all  studies.  The  majority  of  disagreement 
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interpretations  resulted  when  the  telemammography  interpretation  resulted  in  a  recommendation  to  recall  and  the 
actual  clinical  interpretation  resulted  in  a  recommendation  not  to  recall.  There  were  a  total  of  1635  disagreement 
interpretations  across  all  three  multi-mode  studies  between  the  telemammography  and  actual  clinical  interpretation 
and  of  these  disagreements  86.7%  (1418/1635)  occurred  when  the  telemammography  interpretation  resulted  in  recall 
and  the  clinical  interpretation  resulted  in  a  recommendation  not  to  recall. 

In  Study  1,  the  recall  rates  of  all  the  radiologists  during  the  telemammography  interpretations  significantly  increased 
from  mode  1  (images  only)  to  mode  2  (images  and  technologist’s  text  message)  while  the  agreement  between  the 
telemammography  and  actual  clinical  interpretations  decreased  from  mode  1  to  mode  2  for  three  out  of  four 
radiologists  (Tables  3  and  4).  Modes  1  and  2  of  Study  1  had  mean  Kappa  of  0.125  (+/-  0.041)  and  0.102  (+/-0.059), 
respectively,  mean  agreements  of  51.7%  (+/-  5.5)  and  48.7%  (+/-  6.3),  respectively,  and  mean  recall  rates  of  73.3% 
(+/-  17.9)  and  82.5%  (+/-  16.2),  respectively.  The  mean  number  of  disagreement  interpretations  between  the 
telemammography  and  actual  clinical  interpretations  were  62.8  and  66.8  for  modes  1  and  2,  respectively.  The  mean 
percentage  of  these  disagreements  occurring  when  the  telemammography  interpretation  resulted  in  a  recommendation 
to  recall  and  the  clinical  interpretations  resulted  in  a  recommendation  not  to  recall  were  83.9%  (53.5/62.8)  and  91.2% 
(61.5/66.8)  for  modes  1  and  2,  respectively. 


Table  3 

Study  1,  mode  1  (images  only):  telemammography  workstation  interpretations  compared  to  clinical 
interpretations 

Telemammography  Clinical  interpretation 

recommendations _ recall  (n  =  51) _ no-recall  (n  =  79) _ Total  (n=130) _ Kappa 


Radiologist  1 
recall 

38.5%  (50) 

52.3%  (68) 

90.8%  (118) 

0.097 

no-recall 

0.8%  (1) 

8.5%  (11) 

9.2%  (12) 

Radiologist  2 
recall 

31.5%  (41) 

40.0%  (52) 

71.5%  (93) 

0.127 

no-recall 

7.7%  (10) 

20.8%  (27) 

28.5%  (37) 

Radiologist  3 
recall 

23.8%  (31) 

25.4%  (33) 

49.2%  (64) 

0.182 

no-recall 

15.4%  (20) 

35.4%  (46) 

50.8%  (66) 

Radiologist  4 
recall 

34.6%  (45) 

46.9%  (61) 

81.5%  (106) 

0.093 

no-recall 

4.6%  (6) 

13.8%  (18) 

18.5%  (24) 

Table  4 

Study  1,  mode  2  (images  and  technologist’s  text  message):  telemammography  workstation  interpretations 
compared  to  clinical  interpretations 

Telemammography  Clinical  interpretation 

recommendations  recall  (n  =  51)  no-recall  (n  =  79)  Total  (n=  130)  Kappa 

Radiologist  1 
recall 

39.2%  (51) 

59.2%  (77) 

98.5%  (128) 

0.020 

no-recall 

0.0%  (0) 

1.5%  (2) 

1.5%  (2) 

Radiologist  2 
recall 

36.2%  (47) 

48.5%  (63) 

84.6%  (110) 

0.103 

no-recall 

3.1%  (4) 

12.3%  (16) 

15.4%  (20) 

Radiologist  3 
recall 

27.7%  (36) 

32.3%  (42) 

60.0%  (78) 

0.159 

no-recall 

11.5%  (15) 

28.5%  (37) 

40.0%  (52) 

Radiologist  4 
recall 

37.7%  (49) 

49.2%  (64) 

86.9%  (113) 

0.124 

no-recall 

1.5%  (2) 

11.5%  (15) 

13.1%  (17) 
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The  differences  between  modes  1  (images  and  technologist’s  text  message)  and  2  (images,  technologist’s  text 
message,  and  prior  report)  of  Study  2  were  a  slight  decrease  in  the  recommended  recall  rates  for  three  of  the  four 
radiologists  from  mode  1  to  mode  2  as  well  as  small  changes  in  agreement  across  the  radiologists  between  the  two 
modes  (Tables  5  and  6).  Modes  1  and  2  of  Study  2  had  mean  Kappa  of  0.163  (+/-  0.077)  and  0.165  (+/-  0.081), 
respectively,  mean  agreements  of  52.3%  (+/-  6.7)  and  52.8%  (+/-  7.0),  respectively,  and  mean  recall  rates  of  79.6% 
(+/-  12.3)  and  77.5%  (+/-  13.8),  respectively.  The  mean  number  of  disagreement  interpretations  between  the 
telemammography  and  actual  clinical  interpretations  were  47.3  and  46.8  for  modes  1  and  2,  respectively.  The  mean 
percentage  of  these  disagreements  occurring  when  the  telemammography  interpretation  resulted  in  a  recommendation 
to  recall  and  the  clinical  interpretation  resulted  in  a  recommendation  not  to  recall  were  92.5%  (44.0/47.3)  and  90.8% 
(42.8/46.8)  for  modes  1  and  2,  respectively. 


Table  5 

Study  2,  mode  1  (images  and  technologist’s  text  message):  telemammography  workstation  interpretations 
compared  to  clinical  interpretations 

Telemammography  Clinical  interpretation 


recommendations _ recall  (n  =  38) _ no-recall  (n  =  61) _ Total  (n=99) _ Kappa 


Radiologist  1 
recall 

32.3%  (32) 

36.4%  (36) 

68.7%  (68) 

0.219 

no-recall 

6.1%  (6) 

25.3%  (25) 

31.3%  (31) 

Radiologist  2 
recall 

37.4%  (37) 

44.4%  (44) 

81.8%  (81) 

0.208  ' 

no-recall 

1.0%  (1) 

17.2%  (17) 

18.2%  (18) 

Radiologist  3 
recall 

32.3%  (32) 

39.4%  (39) 

71.7%  (71) 

0.174 

no-recall 

6.1%  (6) 

22.2%  (22) 

28.3%  (28) 

Radiologist  4 
recall 

38.4%  (38) 

57.6%  (57) 

96.0%  (95) 

0.051 

no-recall 

0.0%  (0) 

4.0%  (4) 

4.0%  (4) 

Table  6 


Study  2,  mode  2  (images,  technologist’s  text  message,  and  prior  report):  telemammography  workstation 
interpretations  compared  to  clinical  interpretations 

Telemammography  Clinical  interpretation 


recommendations 

recall  (n  =  38) 

no-recall  (n  =  61) 

Total  (n=99) 

Kappa 

Radiologist  1 
recall 

30.3%  (30) 

34.3%  (34) 

64.6%  (64) 

0.206 

no-recall 

8.1%  (8) 

27.3%  (27) 

35.4%  (35) 

Radiologist  2 
recall 

37.4%  (37) 

42.4%  (42) 

79.8%  (79) 

0.237 

no-recall 

1.0%  (1) 

19.2%  (19) 

20.2%  (20) 

Radiologist  3 
recall 

31.3%  (31) 

38.4%  (38) 

69.7%  (69) 

0.167 

no-recall 

7.1%  (7) 

23.2%  (23) 

30.3%  (30) 

Radiologist  4 
recall 

38.4%  (38) 

57.6%  (57) 

96.0%  (95) 

0.051 

no-recall 

0.0%  (0) 

4.0%  (4) 

4.0%  (4) 

The  difference  between  modes  1  (images  and  technologist’s  text  message),  2  (images,  technologist’s  text  message, 
and  prior  report),  and  3  (images,  technologist’s  text  message,  prior  report,  and  CAD)  in  Study  3  were  relatively  small 
with  a  slight  decreased  in  Kappa  between  the  three  modes  (Tables  7,  8  and  9).  Modes  1,  2  and  3  of  Study  3  had  mean 
Kappa  of  0.213  (+/-  0.072),  0.206  (+/-  0.060),  and  0.201  (+/-  0.061),  respectively,  mean  agreements  of  57.4%  (+/- 
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4.6),  57.1%  (+/-  3.9),  and  56.7%  (+/-  3.9),  respectively,  and  mean  recall  rates  of  72.3%  (+/-  9.3),  72.3%  (+/-  9.3),  and 
72.7%  (+/-  9.2),  respectively.  The  mean  number  of  disagreement  interpretations  between  the  telemammography  and 
actual  clinical  interpretations  were  49.0,  49.4,  and  49.8  for  modes  1,  2  and  3,  respectively.  The  mean  percentage  of 
these  disagreements  occurring  when  the  telemammography  interpretation  resulted  in  a  recommendation  to  recall  and 
the  clinical  interpretation  resulted  in  a  recommendation  not  to  recall  was  82.7%  (40.6/49.0),  82.4%  (40.8/49.4)  and 
81.9%  (40.8/49.8)  for  modes  1,  2,  and  3,  respectively. 


Table  7 

Study  3,  mode  1  (images,  technologist’s  text  message,  and  prior  report):  telemammography  workstation 
interpretations  compared  to  clinical  interpretations 

Telemammography  Clinical  interpretation 

recommendations _ recall  (n  =  47) _ no-recall  (n  -  68) _ Total  (n=l  15) _ Kappa 


Radiologist  1 
recall 

38.3%  (44) 

38.3%  (44) 

76.5%  (88) 

0.255 

no-recall 

2.6%  (3) 

20.9%  (24) 

23.5%  (27) 

Radiologist  2 
recall 

39.1%  (45) 

46.1%  (53) 

85.2%  (98) 

0.152 

no-recall 

1.7%  (2) 

13.0%  (15) 

14.8%  (17) 

Radiologist  3 
recall 

33.9%  (39) 

28.7%  (33) 

62.6%  (72) 

0.318 

no-recall 

7.0%  (8) 

30.4%  (35) 

37.4%  (43) 

Radiologist  4 
recall 

30.4%  (35) 

33.9%  (39) 

64.3%  (74) 

0.157 

no-recall 

10.4%  (12) 

25.2%  (29) 

35.7%  (41) 

Radiologist  5 
recall 

34.8%  (40) 

38.3%  (44) 

73.0%  (84) 

0.182 

no-recall 

6.1%  (7) 

20.9%  (24) 

27.0%  (31) 

Table  8 

Study  3,  mode  2  (images,  technologist’s  text  message,  prior  report,  and  technologist’s  graphic  marks): 
telemammography  workstation  interpretations  compared  to  clinical  interpretations 

Telemammography  Clinical  interpretation 

recommendations  recall  (n  =  47)  no-recall  (n  =  68)  Total  (n=l  15)  Kappa 

Radiologist  1 
recall 

38.3%  (44) 

38.3%  (44) 

76.5%  (88) 

0.255 

no-recall 

2.6%  (3) 

20.9%  (24) 

23.5%  (27) 

Radiologist  2 
recall 

39.1%  (45) 

46.1%  (53) 

85.2%  (98) 

0.152 

no-recall 

1.7%  (2) 

13.0%  (15) 

14.8%  (17) 

Radiologist  3 
recall 

33.0%  (38) 

29.6%  (34) 

62.6%  (72) 

0.285 

no-recall 

7.8%  (9) 

29.6%  (34) 

37.4%  (43) 

Radiologist  4 
recall 

30.4%  (35) 

33.9%  (39) 

64.3%  (74) 

0.157 

no-recall 

10.4%  (12) 

25.2%  (29) 

35.7%  (41) 

Radiologist  5 
recall 

34.8%  (40) 

38.3%  (44) 

73.0%  (84) 

0.182 

no-recall 

6.1%  (7) 

20.9%  (24) 

27.0%  (31) 
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Table  9 

Study  3,  mode  3  (images,  technologist’s  text  message,  prior  report,  technologist’s  graphic  marks,  and 
CAD):  telemammography  workstation  interpretations  compared  to  clinical  interpretations 

Telemammography  Clinical  interpretation 


recommendations _ recall  (n  =  47)  no-recall  (n  =  68)  Total  (n=l  1 5)  Kappa 


Radiologist  1 
recall 

38.3%  (44) 

39.1%  (45) 

77.4%  (89) 

0.241 

no-recall 

2.6%  (3) 

20.0%  (23) 

22.6%  (26) 

Radiologist  2 
recall 

39.1%  (45) 

46.1%  (53) 

85.2%  (98) 

0.152 

no-recall 

1.7%  (2) 

13.0%  (15) 

14.8%  (17) 

Radiologist  3 
recall 

33.0%  (38) 

29.6%  (34) 

62.6%  (72) 

0.285 

no-recall 

7.8%  (9) 

29.6%  (34) 

37.4%  (43) 

Radiologist  4 
recall 

30.4%  (35) 

34.8%  (40) 

65.2%  (75) 

0.143 

no-recall 

10.4%  (12) 

24.3%  (28) 

34.8%  (40) 

Radiologist  5 
recall 

34.8%  (40) 

38.3%  (44) 

73.0%  (84) 

0.182 

no-recall 

6.1%  (7) 

20.9%  (24) 

27.0%  (31) 

4.  DISCUSSION 


In  this  controlled  study,  the  percentage  of  breast  cancer  screening  mammography  exams  recommended  for  additional 
procedures  (“recall”)  by  the  radiologists  when  interpreting  exams  suspected  by  technologists  to  require  additional 
procedures  using  the  telemammography  system  was  significantly  higher  than  the  actual  clinical  interpretations  of  the 
same  exams.  Adding  non-mammographic  image  information  (i.e.,  technologist’s  text  message  to  describe  suspicious 
regions,  prior  patient  reports  or  history,  technologist’s  graphic  marks  to  highlight  suspicious  regions,  and  CAD)  to  the 
telemammography  system  did  not  significantly  change  the  radiologists’  interpretations  compared  with  mammographic 
image  only  interpretations.  The  majority  of  disagreement  occurred  when  the  telemammography  interpretation 
recommended  recall  and  clinical  interpretations  recommended  no-recall. 

The  significantly  high  recall  rates  (nearly  double)  interpreting  screening  mammography  exams  using  the 
telemammography  system  as  compared  to  the  actual  clinical  interpretation  is  in  agreement  with  our  previous  study 29 
and  similar  to  Elmore  et  al.13  (1994).  The  mean  telemammography  recall  rates  ranged  from  72.3%  to  82.5%  while  the 
actual  clinical  recall  rates  ranged  from  38.4%  to  42.3%  across  the  three  separate  multi-mode  studies. 

This  study  indicates  that  technologists  are  sensitive,  if  not  specific,  to  the  mammography  features  and  changes  that 
may  lead  to  a  recall.  There  were  245  unique  screening  mammography  exams  used  in  this  study  and  60.0%  (147/245) 
were  not  recommended  for  recall  for  additional  procedures  during  the  actual  clinical  interpretation.  Of  these  147 
exams  not  recalled  49.0%  (72/147)  had  a  clinical  BIRADS  category  of  2.  Therefore,  the  technologists  were  able  to 
detect  abnormal  findings  during  the  mammography  exam,  but  were  not  skilled  at  differentiating  whether  the  findings 
may  represent  potential  disease. 

The  radiologists’  high  recall  rates  for  additional  procedures  using  the  telemammography  system  suggests  a  critical 
dependence  on  prior  images  when  making  management  decisions  albeit  other  factors  may  have  had  some  effect. 
Therefore,  our  telemammography  system  was  recently  modified  to  incorporate  images  from  the  patient’s  prior 
screening  exam  (when  available).  The  modified  telemammography  system  to  include  prior  image  data  was  tested 
operationally  and  a  clinically  simulated  study  to  evaluate  the  impact  of  this  final  modification  on  recommended  recall 
rates  is  underway. 

There  are  several  limitations  to  the  current  study  that  may  have  caused  the  high  recall  rate  during  the 
telemammography  interpretation.  First,  the  telemammography  interpretation  in  this  study  constituted  a  limited  review 
due  to  the  lack  of  images  from  the  prior  screening  mammography  exam  for  comparison.  Second,  the  lack  of  prior 
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images  combined  with  the  technologists  skill  at  detecting  abnormalities  in  the  mammography  exams  and  their  (the 
technologist)  description  of  the  abnormality  (e.g.,  new  finding,  moderate  change  in  finding)  may  have  influenced  the 
radiologists  to  recommend  additional  procedures  to  further  evaluate  the  “abnormal”  findings  suggested  by  the 
technologist.  Third,  the  participating  radiologists  may  have  expected  an  “enriched”  sample  population  because  they 
knew  that  this  was  a  laboratory  study.  A  similar  explanation  for  high  recall  rates  was  reported  in  Elmore  et  al.13 
(1994).  Finally,  this  retrospective  study  did  not  affect  clinical  management  of  the  patient,  so  this  knowledge  may 
prompted  the  observed  over-reading. 

The  seven  experienced  radiologists  who  participated  in  this  study  confirmed  the  feasibility  of  our  telemammography 
system  to  provide  remote  patient  “management”  when  a  physician  is  not  present  in  the  clinic.  Particularly,  our  effort 
to  reduce  the  number  of  patients  recommended  for  recalled  for  additional  procedures  as  part  of  breast  cancer  screening 
mammography  through  the  identification  of  these  patients  while  they  remain  at  the  remote  clinic,  hence  reducing 
patient  anxiety  associated  with  recall.  The  limited  information  provided  to  the  radiologists  (i.e.,  no  images  from  the 
prior  exam)  enabled  a  moderate  reduction  in  the  number  of  recommended  recalls  by  the  technologists  by 
approximately  25  percent.  Inclusion  of  the  final  pertinent  information  component,  mammographic  images  from  the 
prior  exam,  is  expected  to  further  reduce  the  number  of  recommended  procedures  and  significantly  improve  the 
agreement  between  the  telemammography  and  actual  clinical  interpretations. 
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